Covid 19 Trends — Determining them Automatically

4 min readSep 28, 2020

Our world produces a huge amount of data. When one has skills in handling and manipulating data, it’s tempting to get involved in trying to understand all sorts of data series.

Yet data series have underlying causal structures. Wading into another field’s territory leaves you wide open to all sorts of accusations. So I’m writing this tentatively, as I’m not an epidemiologist, nor a biologist.

But I have done a lot of work looking at trends in time series data over the years, including helping in the development of a package that attempts to automatically detect trends — gets, in R.

As we’ve headed into a second wave of Covid-19, I’ve been inspired to look at NHS hospitalisation data — not least, by the kinds of dismissals commonly seen on Twitter of worries about a second wave — it’s a “casedemic”, only cases, nothing else.

Ploughing a seemingly lonely furrow is University of Bristol mathematician Oliver Johnson, presenting the data on a regular basis since the first wave. At first, dismissing those overly worried as cases and deaths didn’t fall as fast as hoped. Doing so with data. And now, pointing out where things are going — with data.

It’s all about trends. Viruses spread exponentially, person to person, and hence if logs of the data are taken, such trends are linear. A steeper straight line is a steeper daily percentage increase (or decline).

Here’s the normal data:

Here’s the logged version:

That green line is a set of split trends (trends that change during the sample). I didn’t draw it on, or make up the dates. Once the dates are determined, the slope is determined.

I used gets, the R package, to detect these trends automatically, using a process called Indicator Saturation. What is that? Here’s a good explainer. It’s the process of saturating a model with indicator variables (dummies for outliers, step changes, trends, or things more exotic than that), and doing so in batches (to avoid perfect multicollinearity). Then retaining the ones in each batch that are significant, pooling them all for one final model, to retain the ones that are significant then.

It’s got pretty good properties. It’s not perfect, since no procedure is going to be. But it gives a good stab at what observations may be outliers, where structural changes may have happened, and where split trends occurred.

The latter point is important in the context of Covid-19. We want to know what the trend is (to “flatten the curve”), and we want to know when changes happened.

So in the picture above, the trend to April 2 was 12%, before turning negative, and negative for many months. Since about September 1 (there’s a funny kink about then) it’s been about 6.7%. That’s for England. For other parts of the country it’s been greater (North West, 8–9%), and others lower (no trend at all in East of England). But all these trends have the common feature of having started late August, early September. Here they all are:

And they are all remarkably stable linear trends (once the data is logged).

Hospitalisations are but one piece of a much bigger, much more complex picture than I can really hope to be able to comprehend myself, let alone present to others.

But I think Indicator Saturation has the potential to be important in determining when changes had an impact — especially given that measures in different parts of the country have varied over recent months.

Covid 19 Trends — Determining them Automatically

Written by James Reade

No responses yet