Averages Are Evil

The basis of science, and one of its main claims to epistemic validity, is observation and measurement. We do experiments and follow evidence. We enquire into the workings of the world by means of data.

Unfortunately, data is not always well-behaved. It is tricksy and wayward and noisy, subject to contamination and confounding and being not what it seems. The more complex the processes being studied, the more data is needed and the more sources of error there are. And most life science processes are very complex indeed.

Over the years statisticians have come up with a wide variety of techniques for wrestling useful information out of noisy data, ranging from the straightforward to the eye-wateringly complicated. But the best-known and most widely used, even by non-scientists, is much older and simpler still: the average, usually in the form of the arithmetic mean.

Averaging is easy: add up all your data points and divide by how many there were. Formal notation is frankly superfluous, but I’ve got a MathJax and I’m gonna use it, so for data points $x_1$, $x_2$, …, $x_n$:

$$\bar{x} = \frac{1}{n} \sum^n_{i=1} x_i$$

The intuition behind averaging is as appealingly straightforward as the calculation: we’re amortising the errors over the measurements. Sometimes we’ll have measured too high, sometimes too low, and we hope that it will roughly balance out.

Because it’s easy to understand and easy to use, averaging is used a lot. I mean A LOT. All the time, for everything. Which is often fine, because it’s actually a pretty useful statistic when applied in an appropriate context. But often it’s a horrible mistake, the sort of thing that buttresses false hypotheses and leads to stupid wrong conclusions. Often it is doing literally the opposite of what the user actually wants, which is to reveal what is going on in the data. Averages can all too easily hide that instead.

Obviously this is not really the fault of the poor old mean. It’s the fault of the scientist who isn’t thinking correctly about what their analysis is actually doing. But averaging is so universal, so ubiquitous, that people just take it for granted without much pause for thought.

The fundamental problem with averaging is, vexingly, also the thing that makes it appealing: it reduces a potentially complex set of data into something much simpler. Complex data sets are difficult to understand, so reduction is often desirable. But in the process a lot of information gets thrown away. Whether or not that’s an issue depends very much on what the data is.

In the simple error model described above, the extra data really is just noise — errors in the measurement process that obscure the single true value that we wish to know. This is the ideal use case for the mean, its whole raison d’être. Provided our error distribution is symmetric — which is to say, we’re about equally likely to get errors either way — we will probably end up with a reasonable estimate of the truth by taking the mean of a bunch of measurements. We don’t really care about the stuff we’re throwing away.

Histogram of a unimodal data set.
For unimodal data, the mean (indicated here by the dashed line) can give a reasonable estimate of the typical (or even “true”) value in the population.

However, this is a very specific kind of problem, and many — perhaps most — sets of data that we might be interested in aren’t like that. It’s actually pretty rare to be looking for a single true value in a data set, because most realistic populations are diverse. If the distribution of the data is not unimodal — meaning clustered around one central value — then the average is going to mislead.

Histogram of non-unimodal data
When the data is not unimodal, the mean value (dashed line) may be completely unrepresentative.

What is the average human height or weight? That seems like a plausible use of the mean, but it’s barely even a meaningful question. A malnourished premature newborn and the world’s tallest adult simply aren’t commensurate. It’s like taking the average of a lightbulb and a school bus. The result tells you nothing useful about either.

This problem is significantly compounded when you start wanting to compare data sets. Which is something we always want to do.

You can, of course, compare two means. One of the most basic and widely used statistical tests — the Student t-test — will do exactly that. But to do so is explicitly to assert that those means do indeed capture what you want to compare. In a diverse population — and again, most realistic populations are diverse — that is a strong assumption, one that needs to be justified with evidence.

Let’s say you’ve got two sets of observations. The sets are related in some way — they might be from patients before and after a treatment, or children before and after a year of schooling, or shoes worn on left and right feet. The raw data look like this:

Raw Data

You’re looking for a difference — a change, let’s call it an improvement — between these two sets, so you take the means:

Mean values of each group

Well, that’s kind of promising: observation 2 definitely looks better. Maybe you draw a line from one to the other to emphasise the change. Obviously there are some differences across the population, so you throw in an error bar to show the spread:

Difference of means with SD error bar

Here I’ve shown the standard deviation, a common and (given some distributional assumptions) useful measure of the variability in a data set. Very often people will instead use a different measure, the standard error of the mean (often shortened to standard error). This is a terrible practice that should be ruthlessly stamped out, but everyone keeps on doing it because it makes their error bars smaller:

Mean change with SEM

While you’re at it, you might perform the aforementioned t-test on the data and boldly assert there’s a less than 5% probability* the observed improvement could have happened by chance. Huzzah! Write your Nature paper, file your patent, prepare to get rich.

But what we’ve done here is gather a bunch of data — maybe through years of tedious and costly experiments — and then throw most of it away. Some such compression is inevitable when data sets are large, but it needs to be done judiciously. In this case the sets are not actually large — and I’ve concocted them to make a point — so let’s claw back that discarded information and take another look.

In using the means to assess the improvement we implicitly assumed the population changes were homogenous. If instead we look at all the changes individually, a different picture emerges:

Pairing data between observations

It’s pretty clear that not everyone is responding the same way. Our ostensible improvement is far from universal. In fact there are two radically different subsets in this data:

Paired data with group colouring

Fully half of the subjects are significantly worse off after treatment — and in fact they were the ones who were doing best to begin with. That’s something we’d really better investigate before marketing our product.

If this were real data, we would want to know what distinguishes the two groups. Is it the mere fact of having a high initial level of whatever it is we’re measuring? Is there some other obvious distinction like sex or smoking? Some specific disease state? Or is there a complex interplay of physiological and social factors that leads to the different outcome? There might be an easy answer, or it might be completely intractable.

Of course, it’s not real data, so the question is meaningless. But the general shape of the problem is not just an idle fiction. It’s endemic. This is rudimentary data analysis stuff, tip of the iceberg, Stats 101 — and people get it wrong all the sodding time. They’re looking at populations they know are drastically heterogenous, but they can scrape a significant p-value by comparing means and that’s all that matters.

Stop it. Don’t be that person. Don’t toss your data away. Recognise its structure. Plot it all. Don’t hide its skew. Don’t make unwarranted assumptions. And don’t take an average unless you actually fucking mean it.


* I’ll save the rant about significance tests for another time.

Fear of Music

On a possibly happier note, there’s this:

Some other SoundCloud “albums” passed unreported here, but this one entertains me for some reason and is the first in ages to make it to BandCamp — rest assured as always that no-one is expected to buy the fucking thing, BC just represents a different grouping mechanism.

And yes, this is my third post in three days. Some kind of inertial shift? No promises whatsoever.

Plate Tectonics

I happened to be reading Guy Gavriel Kay’s latest in June, which seemed grimly apt. A common pattern in several of his novels — though perhaps less in Children of Earth and Sky, as it turns out — is of events accumulating into a huge societal transformation, whose enormity is not apparent until afterwards. Errors of judgement, missed opportunities, subtly shifting political alliances and conflicts of interest conspire to bring about the end of a golden age, destroy a fragile civilisation, harden and coarsen and entrench attitudes in a wearied population. Life goes on — what else should it do? — just a bit less well. Only in retrospect do we perceive the knife edge on which it was so finely balanced.

These are fictions, of course, heightened and romanticised, and who knows how well they capture anything of the real historical moments on which Kay draws — the fall of the Tang and Northern Song, Byzantium and al-Andalus. An appeal to a Golden Age is always dangerous. Nostalgia corrupts; just look around us now.

Still, it certainly feels like one of those moments. A collective surrender to the imp of the perverse; a yearning for things to be made worse. And it’s not over yet, of course.

I’ve intermittently felt the urge to write about it, but it’s been difficult to summon much enthusiasm for blogging amidst the ruins, the trembling ground, the overwhelming sense of unmooring. A part of my identity is leaching away.

In the poisoned discourse of 2016, we are told that it is arrogant and elitist and anti-democratic to complain about the abrogation of a whole population’s rights at the whim of a narrow majority of actively-deceived voters, to rail against the generational betrayal of the young by delusional elderly racists, or to point out that contradictory goals do not suddenly become reconcilable just because an uneasy coalition of people with opposing aims all declared a desire for their own particular fantasy of not the status quo.

We voted for magic. Now bring me my unicorn!

Well, fuck that shit. Fuck the vanity of uninformed opinion, the false equivalence of visceral prejudice with expertise, the active disdain for reality. Fuck the shameless lies and pandering of nauseating hucksters like Gove and Johnson, peddling random policy baubles and then backing away with an insouciant shrug. Fuck the sociopathic (and ongoing) rabble-rousing of haterags like the Express and Mail and the cowed pseudo-balance of the BBC. Fuck the insistence of the ignorant that their vapid views be listened to and taken seriously.

That seething mass of mutually-incompatible twattery who make up the 52% are wrong and their misexpression of misdesire deserves no fucking respect at all. Literally every single reason for voting Leave boils down to one or both of evil and stupid.

I am prepared to accept that most of those people are not evil.

 

Aurora

“[…] People do seem to get addicted to their resentments. It must be like an endorphin, or a brain action in a temporal region, near the religious and epileptic nodes. I read a paper saying as much.”

“Fine for you, but let’s stick to the problem at hand. People feeling resentment are not going to give up on it when they are told they are drug addicts enjoying a religious seizure.”