## How to Lie with Statistics

Yesterday’s post on the Heritage Foundation’s inaccurate analysis on teen sex data touched on an important point.

Heritage used 0.10 as what’s called the *significance level*. Here’s what that’s all about.

At its most simple, a statistical test (a *hypothesis test*) is a comparison between a set of numbers, real world measurements, and a theoretical distribution. The theoretical distribution has a set of well defined assumptions, and it’s properties are well understood (think of the bell-shaped, or Gaussian, curve). If the measurements are too different from the theoretical distribution, we reject the hypothesis that process we measured matches the assumptions of the theoretical distribution.

Some people work well with examples:

Say I want to know whether there are more males (**M** males) than females (**F** females) in a population of mice (of size **N**). I catch 50 mice (**n**=50).

If I caught every mouse(**N=n=50)**, I would know whether there are really more males than females by counting, but assume that I didn’t catch them all.

So I count the males in the sample (**m**). I want to know if the male fraction of the population (**M/N**) is greater than 0.5 (half). All I know is the number of males in the sample (**m**); say it’s 31, giving a sample ratio of 0.62.

If we make some reasonable assumptions about my ability to catch a random sample of mice, theory tells me I can compare the ratio **m/n** to the ratio **M/N**, that the two should be pretty close, and how close will be determined by something called the *binomial distribution*. Don’t worry about the details.

The distribution in this case is a way of representing the probability that the data in the sample will be within a certain distance from the actual value in the population. Just as the fraction of heads that you get when you flip a fair coin will get closer to 50% the more times you flip, the more mice you catch, the closer the sample ratio will be to the real ratio.

In statistics, you want to falsify your hypothesis. You set up a dummy hypothesis, hoping to knock it down. In this case, we want to know if there are more males than females, so our *null hypothesis* is that the fraction of males in the population is less than or equal to 50%. If we reject that hypothesis, we can say that there must be more males than females. Tricky, huh?

So we do some math and find that the probability of drawing 31 males and 19 females from a population that is truly 50:50 is 0.0595, or 5.95%.

So what? What does 5.95% mean to us? It means that 5.95% of the time, a sample from a 50:50 population would give us a sample as skewed or more skewed than this one.

Is that good or bad?

This is where you start following convention, rather than math. Traditionally, people use a *significance value* of 5% (0.05) ; that’s a magic number below which results are considered “*significant*.” One can choose other values, all that number indicates is how many mistakes you are willing to make. The smaller that number is, the fewer times you will get a result which seems to falsify the null hypothesis even though that hypothesis is true (a false positive, statisticians call it Type I error).

People find that 0.05 is pretty fair. On average, one experiment in 20 will give results which seem significant, but are actually not. It also tends to balance nicely against Type II error, the probability that you will fail to reject a false hypothesis, a false negative.

Heritage decided on a lower standard than normally used. By setting their significance level at 10%, they were able to say that differences existed between groups where an investigator using conventional significance levels would have said there was no effect.

In the mouse example above, a 5% level would not be significant, but it would be significant at the 10% level. (That’s why I chose 31; at 30, the probability is 10.13%.)

If you were doing one analysis without much data, using a high significance level might be tolerable. But Heritage had huge swaths of data, and ran lots of tests. By setting high significance levels (easy to clear) and running lots of tests, they guaranteed that they’d get erroneous significance.

Why does the fact that they ran lots of tests matter? If they ran 10 unplanned comparisons where none showed any effect, the probability of getting one or more falsely significant results would be 1-.9^10. That’s 0.65, or 65%. If they used the 0.05 level (like most people do), they’d have a 40% chance of at least one error.

There are ways around all that, but you only use them if you’re being honest. One commenter referred to Bonferroni, which is one common technique. All you do is divide the significance value by the number of unplanned comparisons. If you did ten comparisons at an experiment-wide 0.05 significance level, you’d only reject your hypothesis if the test gave you 0.005 (0.05/10). Other ways (possibly superior) exist.

So, when reading about statistics, be sure to think about the significance level being used, and ask whether the investigators kept running tests until one came out the way they wanted. Honest investigators will explain what they did, dishonest ones will try to sneak it past you.

Now you can go back to reading about Michael Jackson.