Thursday, 5 June 2025

On data

While waiting for the right time to get up a couple of days ago, prompted by having been turning the pages of reference 1 at the Internet Archive the day before, I got to thinking about types of data. We suppose the sort of data you might find in the cells of an Excel worksheet. A worksheet where the rows correspond to the members of some population, some sample or other, and the columns correspond to the various properties of those members.

In the snap above we have such a worksheet, purporting to be a chunk of the record of purchases as seen from the Sainsbury’s regional headquarters in North Surrey. Not terribly realistic, but hopefully it serves the present purpose.

Noting that there is plenty of design, even in this very small worksheet. There are lots of ways to hold records of this sort and the trick of the design is to take good account of both the context and the requirements, that is to say the uses to which the records are to be put.

We suppose that the idea is to produce statistics from this worksheet by means of either a chart or a cross tabulation. Examples of both are included below, derived from the worksheet snapped above. Note that the Excel versions of cross tabulations are called pivot tables.

For present purposes, a chart might be a line graph or a histogram. The first of these shows some numeric value for every member of the population, the subjects, a graphic presentation of a column of the worksheet. In the example above, top right, we might want to run down the cause of the spikes in alert level. The second counts up the number of cases taking each of the values of some categorical property, perhaps the subject’s age band, perhaps the make of toaster in the subject’s kitchen. In the example above, top left, the product class.

While a cross-tabulation is a rectangular array of rows and columns in which the cells are either counts of records (for example, number of deaths) or sums of the values of properties (for example, the total monetary value of a number of orders). Noting that there can be cross-over here in that one can sum a binary property, which sum might, for example, the number of deaths where organophosphates were mentioned on the death certificate, this last being recorded as the binary, yes-no property being summed. Noting that in the examples above, ‘sequence’ is an unhelpful artefact of the pivot table where it means that we are counting qualifying rows of the worksheet.

Properties of properties

Excel allows you to format a cell, to display a cell on screen or on paper, according to the scheme snapped above. A scheme which is related to – but by no means the same – as what follows.

We suppose that the same set of properties is more or less applicable to all the members, to all the rows of the worksheet. That we don’t have a mixed population, say of houses and pets, where quite different properties might be used to describe the different categories. With the complication that both houses and pets might be regarded as possessions and be assigned properties from that point of view.

A property might be a real valued number, like height or weight. As it happens, two properties which are also visually or physically additive – in the sense that if you stick two wooden blocks together, the weight together is the sum of the two weights apart.

A property might be an integer, like number of legs or number of rooms. Usually there is no doubt about this number, although it is often possible to find or postulate marginal cases. When is a room just a big cupboard?

In both these cases we might take an interest in the mean value. Computation of which is often slightly complicated by the need to allow for missing values.

A property might be a category, like kind of pet, for example cat, dog or hamster. In this case there is no mean and there is not even a natural ordering of the categories, although it might be convenient to order them alphabetically, in which case the ordering is an accident of name, rather than reflecting anything more substantial.

On the other hand, it might be one of the Likert scales widely used when surveying people. A commonly used five-point Likert scale is: 

Strongly disagree

Disagree

Neither agree nor disagree

Agree

Strongly agree

A similarly short scale might be used to describe the hardness of minerals. In both these cases there is a natural order, although one difference is that in the first case there is a zero of sorts in the middle.

While if the property is a date, not only is there a natural order, but the difference between two dates is meaningful and can be expressed as a real number or an integer, in days, weeks or whatever. Sometimes the ratio between two values is meaningful, as it is, for example, in the case of height or weight: we can say that X is twice the weight of Y. But not in the case of date.

A property might be derived from a real number by a process sometimes called classification or binning. So, we might classify age into the age bands:

Under five

Five to under twenty

Twenty to under sixty

Sixty and over

Otherwise, an integer between 1 and 4. In an Excel worksheet, sometimes one might use the text description, sometimes the number, sometimes both. A classification which might be convenient from the point of view of producing cross tabulations – but which excludes the calculation of the mean. In which connection, note that a pivot table is apt to present text properties alphabetically, which is not which one wants. There is an example of this at reference 2.

Some properties, perhaps derived from classification of something more complicated, are binary; one or zero, success or failure – either the thing in question is a cat or it is not. In this case, it is meaningful to talk of the mean. And the binomial distribution is a helpful way to describe samples from such a population. While if we had coded 1 for cat, 2 for dog and 3 for hamster, the mean is not without meaning – but that meaning is not usually very interesting.

Going the other way, some properties are very complicated, perhaps organised hierarchically. Classifications of occupation and cause of death, for example, have soaked up a lot of quality time – not to mention paper – over the years.

A special case is the property which is an identifier or label. In this case no two members of the population, no two rows of the corresponding worksheet, share the same value.

Statistical matters

The mean value of a numeric property, for example height or number of children, is often of interest. In the case that our worksheet contains the whole of the population of interest, one just calculates the mean. In the case when it contains a sample, hopefully a random sample from that population, we calculate an estimate of the mean in the same way – but we then want to know how good an estimate that is likely to be.

Alternatively, we might have some prior information about what the mean should be, and we are interested in whether the mean of our sample differs significantly. Are significantly more errors getting into the system when Jonathan is on-shift than when Mavis is on-shift?

Quite often we can simplify this problem by assuming that the estimate comes from a normal population, where the familiar bell-shaped normal distribution is well known and well documented and we rapidly arrive at a p-value. Which is a whole new topic in its own right, left for another day. In the meantime, there is reference 3.

Alternatively, when the variables of interest are categorical, categorical to the extent of taking more than two values, or when we are interested in the joint behaviour or two or more variables, we are pushed us into cross tabulation, aka pivot tables, aka the contingency tables of reference 4. Here the motivation is to explore the extent to which the proportions in a table such as that snapped above are correlated. Does sex have a part in determining whether we are left or right handed? Or vice-versa?

Which is an important, but even more complicated problem.

Conclusions

An apparently banal topic hiding all kinds of complications, not least the statistical complications from where this digression sprang.

Reference 5 is a more serious contribution from before the time of Excel and its worksheets, turned up by Wikipedia and from where the classification of properties snapped above is taken.

PS: while reference 6 is a new to me variety of boreal wildfire, a variety which looks difficult to put a stop to.

References

Reference 1: Statistical Methods for Research Workers – R A Fisher – 1934

Reference 2: https://psmv5.blogspot.com/2025/04/statistical-overdrive.html

Reference 3: https://psmv5.blogspot.com/2021/11/the-p-value-battle-continues.html

Reference 4: https://en.wikipedia.org/wiki/Contingency_table. The source of the snap above.

Reference 5: On the Theory of Scales of Measurement - Stevens, S. S – 1946. 

Reference 6: ‘Zombie fires’ fuel intense Canadian blazes behind plumes spreading to Europe and US: Scientists are studying the phenomenon behind at least 49 active wildfires - Attracta Mooney, Steven Bernard, Jana Tauschinski, Financial Times – 2025.

No comments:

Post a Comment