Fun with Statistics Calculations

Categories: C#

Tags: Algorithms

A while back I was working on a system where we were going to score work items to measure risk of auditing.  Higher numbers would most likely result in an audit while lower numbers would pass.  The exact mechanism of measuring the risk is immaterial for this post, so we’ll treat it as a black box number.  Furthermore, we calculate the risk on all work items but only update our statistics (as described below) on work items that actually did get audited.

I wanted to know whether the audit score for a particular work item was very far away from the mean or very under the mean.  If it was low, the audit risk should be low and vice versa.  What we are looking for here is a “sigma” level – or a number that indicates how far away from the mean something is.  If something has a sigma level of zero, it means it is equal to the mean.  If it has a sigma level of 1, it means it is 1 standard deviation above the mean.  -1 means that it is one standard deviation below the mean.  Lower levels of sigma are generally better than higher ones in this system.  In normalized data, we’d expect over two-thirds of the work items to score within +/- 1 sigma.  A sigma number of 6 or higher means that the score would be a very large outlier.

To calculate this sigma value, we need two primary pieces of data – the mean and the standard deviation of the population or sample (i.e. the audit risk scores).  I did not want to calculate these values over the entire set of data each time I wanted to compute the sigma level – I just wanted to add it to the previous mean and standard deviation to make calculations really fast.

Let’s start with the mean.  If we save the number of data points used (n) and the previous mean calculated (ca), we can derive the new mean given a new data point (x) with the following formula:

new_mean = (x + n * ca) / (n + 1)

Or in C#:

public static double CalculateNewCumulativeAverage(int n, int x, double ca)
{
    return (x + n * ca) / (n + 1);
}

 

The standard deviation calculation is a little harder.  The Wikipedia article at http://en.wikipedia.org/wiki/Standard_deviation#Rapid_calculation_methods describes a method for rapid calculation that requires you to only provide the following variables to compute a new standard deviation given a new data point (x): n – the number of previous data points used, s1 – sum of all previous x’s, and s2 – sum of all previous x^2 (squared).  Here’s the formula in C#:

public static double CalculateNewStandardDeviation(int n, int x, int s1, long s2)
{
    if (n == 0)
        return double.NaN;
    s1 += x;
    s2 += x * x;
    double num = (n + 1) * s2 - (s1 * s1);
    double denom = (n + 1) * n;
    return Math.Sqrt(num / denom);
}

 

This will be a very fast way of calculating standard deviation because you simply don’t have to go over all data points (which also means not reading values out of a database).

The sigma value I talked about earlier can then be calculated given the data point (x), cumulative mean (ca) and standard deviation (s):

public static double CalculateSigma(int x, double ca, double s)
{
    return (x - ca) / s;
}

 

So all you will need to store in your database is the following scalar values to calculate these stats:

  1. Total number of data points (n).
  2. Sum of all data points (s1).
  3. Sum of the squares of all data points (s2).
  4. Cumulative average or mean (ca).
  5. Current standard deviation (s).

To add a new data point (x) and update all variables to new values:

public static void AddDataPoint(int x, ref int n, ref int s1, ref int s2, ref double ca, ref double s)
{
    ca = CalculateNewCumulativeAverage(n, x, ca);
    s = CalculateNewStandardDeviation(n, x, s1, s2);
    n += 1;
    s1 += x;
    s2 += x * x;
}

No Comments