Tag: P-Hacking

Science Sunday: Chocolate Caper

A few weeks ago, the internet lit up with stories that eating chocolate could help you lose weight. This week, the other shoe dropped: the story was bullshit:

I am Johannes Bohannon, Ph.D. Well, actually my name is John, and I’m a journalist. I do have a Ph.D., but it’s in the molecular biology of bacteria, not humans. The Institute of Diet and Health? That’s nothing more than a website.

Other than those fibs, the study was 100 percent authentic. My colleagues and I recruited actual human subjects in Germany. We ran an actual clinical trial, with subjects randomly assigned to different diet regimes. And the statistically significant benefits of chocolate that we reported are based on the actual data. It was, in fact, a fairly typical study for the field of diet research. Which is to say: It was terrible science. The results are meaningless, and the health claims that the media blasted out to millions of people around the world are utterly unfounded.

The important thing to note here is that they did not fake their results. What they did was use an analysis method that is used by a lot of junk science studies in the arena of health:

Here’s a dirty little science secret: If you measure a large number of things about a small number of people, you are almost guaranteed to get a “statistically significant” result. Our study included 18 different measurements—weight, cholesterol, sodium, blood protein levels, sleep quality, well-being, etc.—from 15 people. (One subject was dropped.) That study design is a recipe for false positives.

Think of the measurements as lottery tickets. Each one has a small chance of paying off in the form of a “significant” result that we can spin a story around and sell to the media. The more tickets you buy, the more likely you are to win. We didn’t know exactly what would pan out—the headline could have been that chocolate improves sleep or lowers blood pressure—but we knew our chances of getting at least one “statistically significant” result were pretty good.

Whenever you hear that phrase, it means that some result has a small p value. The letter p seems to have totemic power, but it’s just a way to gauge the signal-to-noise ratio in the data. The conventional cutoff for being “significant” is 0.05, which means that there is just a 5 percent chance that your result is a random fluctuation. The more lottery tickets, the better your chances of getting a false positive. So how many tickets do you need to buy?

P(winning) = 1 – (1 – p)n

With our 18 measurements, we had a 60% chance of getting some“significant” result with p < 0.05. (The measurements weren’t independent, so it could be even higher.) The game was stacked in our favor.

It’s called p-hacking—fiddling with your experimental design and data to push p under 0.05—and it’s a big problem. Most scientists are honest and do it unconsciously. They get negative results, convince themselves they goofed, and repeat the experiment until it “works.” Or they drop “outlier” data points.

You can see this p-hacking illustrated by XKCD here. A similar hack is sometimes referred to as the Texas Sharpshooter Fallacy. The idea is that if you run 100 tests, you will very likely find that one of those tests shows a signal that has a 1% chance of being a coincidence. In fact, as Nate Silver pointed out in his book, if you don’t find that about one in a hundred tests produces a spurious 99% result, you’re doing your statistics wrong.

One of the most infamous was a study in the early 90’s showing that high-tension power lines caused leukemia. Their results was statistically significant. But they tested 800 medical conditions. They were bound to come up with something just by chance.

That’s not to say statistics are useless. It’s to say that they have a context. When you’re testing one specific hypothesis, such as testing if vaccines cause autism, then they are useful. But they can be very deceptive when used in this scattershot approach.

Another illustration is DNA testing. Police in many areas have been doing blind DNA searches of databases to identify suspects in cold cases. When they find their suspect, they claim that the likelihood of a false match is literally one in a million. But these databases have hundreds of thousands of names in them. If you had a specific suspect and other reasons to suspect him, that one in a million stat would mean something. But in a blind search, your odds of finding a match by sheer coincidence is more like one in three.

Bohannon uses the lottery illustration and it’s a perfect one. The odds of any particular person winning the lottery are something like one in tens of millions. But someone is going to beat those odds. Someone always does.

Science — particularly when it comes to health — is littered with these sort of studies: blind searches that find something that then get touted in the media. Vox illustrates it here (point #2). There are statistically significant studies showing both that milk causes and prevents cancer. When you take them all into account, the net risk is basically zero. Of course Vox is in a bit of a glass house, having frequently touted such studies when convenient.