More permanent stuff at http://www.dusbabek.org/~garyd

26 March 2012

Calculating Long Running Averages

Something I've been working on has me calculating averages over time on data arriving into a system.  I don't know beforehand how many pieces of data there will be, which makes the calculation difficult.  This operation needs to be done for both floating point and integers.

The floating point solution turned out to be fairly simple:

double average = 0d;
int count = 0;

// when a number comes in, do this:
average += (new_number - average) / ++count;

Here's how it works:  Realize that when a new number comes in, it is going to pull the average slightly higher or lower.  That difference is this calculation.  Since count is always growing this means that numbers arriving later have far less influence on the average unless they are fairly large (as it should be).

The integer solution looks different even though it is essentially the same thing.  What complicates the problem is that I want to avoid using floating point math if at all possible.  There are several different computations that would work, but here is the one that seemed to follow the true average as much as possible:

long average = 0, remainder = 0;
int count = 0;

// when a number comes in, do this:
count++;
average += (new_number + remainder) / count;
remainder = (new_number + remainder) % count;

Here is how it works:  since I cannot add incremental pieces of a whole number as they arrive (as I did in the floating point example), those incremental chunks are tracked in remainder.  Eventually, if the inputs are fairly uniform, enough delta builds up in remainder to add or subtract a number from the average, or it happens right away if the new number is significantly higher or lower than the average.

I wanted to verify correctness of the algorithm, so I wrote a simulator.  I found early on, the averages vary from the true average a bit, but this variance goes away as count increases.  To compensate for this, I keep track of the sum of all new_numbers and calculate average the normal way until either sum or count overcome predetermined thresholds.

1 comments:

Jonathan Ellis said...

Average, min, and max are a bit of a special case since you can update them without throwing away any data (subject to the accuracy of floating point math, at least).

What blew my mind is that you can actually calculate things like stdev as well. It's like cheating with statistics. Basically, you use a technique called reservoir sampling to keep a representative sample of your data in memory, updated constantly, and compute your metrics from that.

See https://github.com/codahale/metrics/blob/master/metrics-core/src/main/java/com/yammer/metrics/stats/UniformSample.java (and its usages).