Foolish Science

Science, pizza, apps, and mayhem.

Fixing Broken Statistics with the Median Absolute Deviation

Mad scientists often have to deal with measurements that are missing the all important standard error. Having the standard error for each measurement can be really helpful if your trying to pull something out of the noise, like a Higgs Boson or a gravitational wave.

Lets supose we want to learn about just how precise our giant death ray is. We point it at Snooki's last known position, turn the power to "minor sunburn", and fire a couple of thousand times. Lets assume the reason the death ray doesn't hit the exact same spot everytime is totally random (normal distribution), otherwise the standard error isn't really well defined. Take a look at a histogram of those measurements:

Here the standard error is essentially how far a measurement will typically be from the actual value. More specifically, the square of the standard deviation is the average of the square difference from a set of measurements to the actual value. In terms of our histogram, the standard error will be the distance from the center of the histogram that includes about 68% of the measurements. This happens to be about 0.85 the distance from the center of the histogram to when the histogram is half the maximum height (HWHM).

So when we end up with a set of measurements like this, we can just calculate the standard deviation in the typical way (the braces indicate the mean and N is the number of measurements):

For the data shown in this plot, the standard deviation is about 1.0.

But, and this is really important, do we trust all of the measurements? What if our inept intern Vanessa kept accidentally bumping into the death ray during the test? Before we vaporize her, lets look at what that data might look like.

This really screws up the standard deviation, which is now about 1.6, instead of 1.0 (damnit Vanessa!), an over 60% difference from the actual value.

If you suspect your data might have outliers don't despair, but definitely don't calculate the standard deviation the normal way, use the median absolute deviation.

The median absolute deviation is the median absolute difference between the points and the median. Mathematically, it looks like:

As it turns out, the median absolute deviation is related to the standard deviation by a geometric factor:

When we apply this to the data we get a standard deviation of 1.06, only a 6% difference from the actual value. Woohoo! One step closer to world domination.