The above image is a screenshot of an article from io9.com. Yes, the title is completely accurate. John Bohannon writes,
My colleagues and I recruited actual human subjects in Germany. We ran an actual clinical trial, with subjects randomly assigned to different diet regimes. And the statistically significant benefits of chocolate that we reported are based on the actual data. It was, in fact, a fairly typical study for the field of diet research. Which is to say: It was terrible science. The results are meaningless, and the health claims that the media blasted out to millions of people around the world are utterly unfounded.
At first glance, to the non-scientist, Bohannon’s assertion may seem very strange.
I know what you’re thinking. The study did show accelerated weight loss in the chocolate group—shouldn’t we trust it? Isn’t that how science works?
That’s certainly how modern education has taught us to think. The problem is, you can’t trust the results of a study if you only know the results. You need to be able to see the process. We might call this the Weasley Principle, following the words of J. K. Rowling’s character Arthur Weasley: “Never trust anything that can think for itself if you can’t see where it keeps its brain!” It’s quite easy to get whatever result you’re hoping to get if you let your results influence your process. Bohannon explains:
Here’s a dirty little science secret: If you measure a large number of things about a small number of people, you are almost guaranteed to get a “statistically significant” result. Our study included 18 different measurements—weight, cholesterol, sodium, blood protein levels, sleep quality, well-being, etc.—from 15 people. (One subject was dropped.) That study design is a recipe for false positives.
Think of the measurements as lottery tickets. Each one has a small chance of paying off in the form of a “significant” result that we can spin a story around and sell to the media. The more tickets you buy, the more likely you are to win. We didn’t know exactly what would pan out—the headline could have been that chocolate improves sleep or lowers blood pressure—but we knew our chances of getting at least one “statistically significant” result were pretty good.
Whenever you hear that phrase, it means that some result has a small p value. The letter p seems to have totemic power, but it’s just a way to gauge the signal-to-noise ratio in the data. The conventional cutoff for being “significant” is 0.05, which means that there is just a 5 percent chance that your result is a random fluctuation. The more lottery tickets, the better your chances of getting a false positive. So how many tickets do you need to buy?
P(winning) = 1 – (1 – p)n
With our 18 measurements, we had a 60% chance of getting some“significant” result with p < 0.05. (The measurements weren’t independent, so it could be even higher.) The game was stacked in our favor.
It’s called p-hacking—fiddling with your experimental design and data to push p under 0.05—and it’s a big problem. Most scientists are honest and do it unconsciously. They get negative results, convince themselves they goofed, and repeat the experiment until it “works.” Or they drop “outlier” data points.
The problem is exacerbated by the fact that getting a truly significant result is far, far harder than it seems to be at first glance. John Ioannidis, a professor at Stanford’s medical school, made this case in a paper bluntly titled “Why Most Published Research Findings Are False.” As software engineer William Wilson lays out Ioannidis’ argument,
Suppose that there are a hundred and one stones in a certain field. One of them has a diamond inside it, and, luckily, you have a diamond-detecting device that advertises 99 percent accuracy. After an hour or so of moving the device around, examining each stone in turn, suddenly alarms flash and sirens wail while the device is pointed at a promising-looking stone. What is the probability that the stone contains a diamond?
Most would say that if the device advertises 99 percent accuracy, then there is a 99 percent chance that the device is correctly discerning a diamond, and a 1 percent chance that it has given a false positive reading. But consider: Of the one hundred and one stones in the field, only one is truly a diamond. Granted, our machine has a very high probability of correctly declaring it to be a diamond. But there are many more diamond-free stones, and while the machine only has a 1 percent chance of falsely declaring each of them to be a diamond, there are a hundred of them. So if we were to wave the detector over every stone in the field, it would, on average, sound twice—once for the real diamond, and once when a false reading was triggered by a stone. If we know only that the alarm has sounded, these two possibilities are roughly equally probable, giving us an approximately 50 percent chance that the stone really contains a diamond.
This is a simplified version of the argument that Ioannidis applies to the process of science itself. The stones in the field are the set of all possible testable hypotheses, the diamond is a hypothesized connection or effect that happens to be true, and the diamond-detecting device is the scientific method. A tremendous amount depends on the proportion of possible hypotheses which turn out to be true, and on the accuracy with which an experiment can discern truth from falsehood. Ioannidis shows that for a wide variety of scientific settings and fields, the values of these two parameters are not at all favorable.
For instance, consider a team of molecular biologists investigating whether a mutation in one of the countless thousands of human genes is linked to an increased risk of Alzheimer’s. The probability of a randomly selected mutation in a randomly selected gene having precisely that effect is quite low, so just as with the stones in the field, a positive finding is more likely than not to be spurious—unless the experiment is unbelievably successful at sorting the wheat from the chaff. Indeed, Ioannidis finds that in many cases, approaching even 50 percent true positives requires unimaginable accuracy. Hence the eye-catching title of his paper.
That’s bad, but it gets worse:
What about accuracy? Here, too, the news is not good. First, it is a de facto standard in many fields to use one in twenty as an acceptable cutoff for the rate of false positives. To the naive ear, that may sound promising: Surely it means that just 5 percent of scientific studies report a false positive? But this is precisely the same mistake as thinking that a stone has a 99 percent chance of containing a diamond just because the detector has sounded. What it really means is that for each of the countless false hypotheses that are contemplated by researchers, we accept a 5 percent chance that it will be falsely counted as true—a decision with a considerably more deleterious effect on the proportion of correct studies.
None of the difficulties Ioannidis identifies are fatal by themselves, or even in combination. When you get down to it, when it comes to the proper work of science, all they really mean is that science is hard. This is not news. The real problem comes when the human element is introduced. When we think not just about the proper work of science but about science as done by sinful human beings, these difficulties mean that science is hard in ways that can be manipulated by scientists to their own ends. This is the lesson of Bohannon’s hoax paper.
This is not to say that all scientists, or even most scientists, are unscrupulous. It is a warning, however. The wisdom commonly attributed to Sir Francis Bacon, who articulated the scientific method, applies as much to scientists as to anyone else: “Men prefer to believe what they prefer to be true.” As Wilson points out,
Scientists have long been aware of something euphemistically called the “experimenter effect”: the curious fact that when a phenomenon is investigated by a researcher who happens to believe in the phenomenon, it is far more likely to be detected. . . . Even those with the best of intentions have been caught fudging measurements, or making small errors in rounding or in statistical analysis that happen to give a more favorable result. Very often, this is just the result of an honest statistical error that leads to a desirable outcome, and therefore it isn’t checked as deliberately as it might have been had it pointed in the opposite direction.
Confirmation bias is a powerful influence. So too is the desire for a result that will lead to a successful Ph.D. defense and/or a published paper. These influences corrupt science. To work properly, the scientific method depends on scientists actively and consistently working to disprove their hypotheses and theories, so that those which are accepted have been tested as rigorously as possible. This requires constant vigilance against confirmation bias. In today’s culture, it also requires scientists to actively work against what’s good for their careers. If you look for an effect and you find it, you’ve made a discovery; that’s news, and therefore publishable. The more surprising or controversial your discovery, the better. If you look for an effect and don’t find it, you have discovered nothing and have no news to report. Your only message is that things aren’t different than we thought, after all. The industry that has grown up around the practice of science strongly discourages skepticism and rigorous challenging of one’s own ideas.
Again, none of this is to denigrate science or scientists at all. We simply need to remember that science is a human activity conducted by normal human beings who are, as a class, representative of the full moral range of human character. Wilson tells us,
In a survey of two thousand research psychologists conducted in 2011, over half of those surveyed admitted outright to selectively reporting those experiments which gave the result they were after. Then the investigators asked respondents anonymously to estimate how many of their fellow scientists had engaged in fraudulent behavior, and promised them that the more accurate their guesses, the larger a contribution would be made to the charity of their choice. Through several rounds of anonymous guessing, refined using the number of scientists who would admit their own fraud and other indirect measurements, the investigators concluded that around 10 percent of research psychologists have engaged in outright falsification of data, and more than half have engaged in less brazen but still fraudulent behavior such as reporting that a result was statistically significant when it was not, or deciding between two different data analysis techniques after looking at the results of each and choosing the more favorable.
Scientists are neither angels nor saints, and science is not some Thomistic exercise of unfallen human reason. Any criticism which can be levied against “organized religion” can be levied just as well against the scientific establishment. As Wilson argues, the comparison is instructive.
Like monasticism, science is an enterprise with a superhuman aim whose achievement is forever beyond the capacities of the flawed humans who aspire toward it. The best scientists know that they must practice a sort of mortification of the ego and cultivate a dispassion that allows them to report their findings, even when those findings might mean the dashing of hopes, the drying up of financial resources, and the loss of professional prestige. It should be no surprise that even after outgrowing the monasteries, the practice of science has attracted souls driven to seek the truth regardless of personal cost and despite, for most of its history, a distinct lack of financial or status reward. Now, however, science and especially science bureaucracy is a career, and one amenable to social climbing. Careers attract careerists, in [Paul] Feyerabend’s words: “devoid of ideas, full of fear, intent on producing some paltry result so that they can add to the flood of inane papers that now constitutes ‘scientific progress’ in many areas.”
Modern scientific materialism has tried to use science as a replacement for God and religion, and thus “to treat the body of scientific knowledge as a holy book or an a-religious revelation that offers simple and decisive resolutions to deep questions.” It cannot work, and we’re seeing the ill consequences. Science cannot fill the place of God; not even the whole of creation, even if we could perfectly understand all its workings, could substitute for the Creator. The effort to exchange the two isn’t just bad theology, it’s bad science. Wilson’s concluding warning is apt:
When cultural trends attempt to render science a sort of religion-less clericalism, scientists are apt to forget that they are made of the same crooked timber as the rest of humanity and will necessarily imperil the work that they do. The greatest friends of the Cult of Science are the worst enemies of science’s actual practice.