Everything You Know about the P-Value is Wrong
Editor’s Note: This week, I’m excited to introduce two new tl;dr bloggers extraordinaire!
Jason Shumake works as a data scientist at UT Austin’s Institute for Mental Health Research, where he builds statistical models to predict mental health outcomes. He also develops R packages that make it easier for researchers to “practice safe stats”. Check out a couple of his recent publications:
Eimeira Padilla has worked as the investigational drug pharmacist for the Ascension - Texas Family of Hospitals since 2014. In this role, she collaborates with other clinical members in the facilitation of clinical drug research. She is also a Pharmacist Faculty/Preceptor, co-investigator, and mentor on several studies, including multiple pharmacy residency projects, and she has served on the institution’s ethical research review board.
Sounds like they’re the ones we all want to share an office with when completing research projects, am I right? Well since THAT would be an overflowing office, let’s at least count our lucky stars that we have their knowledge laid out here on tl;dr. I don’t think there’s a better or more down to earth overview of the p-value anywhere, so happy stats reading!
(Whoa, I just used ‘happy’ and ‘stats’ in the same sentence. But seriously, happy stats reading!)
EVERYTHING YOU KNOW ABOUT THE P-VALUE IS WRONG
I bet you didn’t know and might be disturbed to learn that most conclusions from published biomedical research are false!
For example, in the field of drug discovery, one report found that fewer than 25% of published findings could be reproduced. This has been largely blamed on a “publish or perish” culture, which offers perverse incentives to report “significant” (p < 0.05) results instead of rewarding the methodical pursuit of truth.
We can’t change research culture with a single blog post. But what we can do is raise your awareness about some bad statistical practices that also contribute to these dire replication rates. We’re going to begin with the mother of all culprits: abusing the p-value.
You may be asking yourself, how do we abuse said p-value? Well, we’re so glad you asked!
WHAT IS A P-VALUE?
Before we dive into p-value abuse and ways to avoid it, let’s review the definition of a p-value.
A p-value is the probability of obtaining a result as or more extreme than the one observed, given that the null hypothesis is true.
Let’s take a closer look at the text in bold because it’s an important part of the definition that’s easy to misunderstand—or forget about entirely.
First, notice the phrase “given that”. This tells us that the probability is conditional on an assumption. The assumption is that the null hypothesis is TRUE.
And what is the null hypothesis? Think of it as the default position that someone is trying to disprove. Usually a statement along the lines of “there’s nothing to see here”, like:
There’s no difference between those groups.
There’s no relationship between those variables.
That drug has no therapeutic effect.
But the null hypothesis doesn’t have to be a negative statement. For example, if someone is trying to demonstrate that Drug A is not inferior to Drug B, then the null hypothesis would be that Drug A is inferior to Drug B. However, to keep things simple, let’s think about a specific, typical example of a null hypothesis - that a drug has no therapeutic effect.
By definition, the p-value applies to “Null-Hypothesis World”—a hypothetical world in which the drug we’re testing has no effect. Like definitely, for sure, no effect. This leads us to the first common misinterpretation of the p-value: A p-value is NOT the probability that the null hypothesis is true.
For example, if we test the effect of our drug against a placebo and obtain p = 0.01, does that mean there is only a 1% chance that our drug is ineffective (and therefore a 99% chance that it is effective)?
Why not? Remember “given that the null hypothesis is true”? The p-value can’t be the probability of the thing it takes to be certain! Here’s the key point:
The p-value is not the likelihood of the null hypothesis given your data; it’s the likelihood of your data given the null hypothesis.
Now you might expect that those two likelihoods are related, and they are. But they are not the same, and it’s not straightforward to derive one from the other. If it’s not obvious why, think about this:
What’s the probability that the ground is wet given the hypothesis it’s raining? Very high, right?
Now, what about the probability that it’s raining given the ground is wet? Not quite the same, huh?
No doubt it’s higher than it’d be if the ground were dry, but there are other reasons besides rain for why the ground might be wet. The same goes for your hypothesis—its likelihood depends on a lot more than one puny data set!
There is a branch of statistics, called Bayesian statistics, that is formally concerned with quantifying the likelihood of a hypothesis in light of prior beliefs and new data. But the traditional approach, called frequentist statistics, takes a more qualitative approach. The logic goes something like this: if the data are super unlikely to occur in Null-Hypothesis World, then infer the data do not come from that world, i.e., the null hypothesis is wrong.
By convention, p < 0.05 has been the typical criterion for “super unlikely if the null hypothesis is true”. But it’s important to understand that this convention—by which we leap from a p-value to a decision about the merits of a hypothesis—is indirect, arbitrary, and subjective.
We’re like 2 degrees removed from actually knowing that our drug is effective, which is why statisticians use seemingly cumbersome language like “fail to reject the null”. It’s a linguistic reminder that the p-value applies to a particular data set and its compatibility with the null hypothesis. If you collect another data set testing the same hypothesis, you’ll almost certainly get a different p-value—often times very different.
We’ll come back to this point in a bit, but for now we want to draw your attention to a very important word in the definition of the p-value, and that word is “the”! As in “given the null hypothesis is true”, as in a single hypothesis test. Not a hundred hypothesis tests. Not ten hypothesis tests. One hypothesis test.
And we also need to point out something that is not usually stated explicitly but is super important: this single hypothesis test must be specified prior to running an experiment. In the modern world, researchers often collect multiple variables and perform multiple statistical tests in the absence of a prespecified hypothesis, or they formulate hypotheses after looking at the data they’ve collected. As we will explain in the future, these practices lead to p-values that are horribly biased.
THE UNLICENSED PRACTICE OF “FOLK STATISTICS”
Based on the definition of the p-value, we cannot make any precise claims relating the magnitude of the p-value and the likelihood that a hypothesis is correct. Yes, in a very rough sense, smaller p-values could signal a more implausible null hypothesis. But there is no statistical basis for assuming, for example, that a p-value of 0.001 means the null hypothesis is 10 times less likely than if the p-value were 0.01. Unfortunately, this is exactly how many researchers interpret p-values.
The table below (largely copied from Geoff Cumming’s excellent YouTube video Dance of the P Values) summarizes this “folk wisdom”, which no sane statistician would ever endorse but which nonetheless gets passed down from generation to generation of biomedical researchers like a cherished family recipe.
Somewhere along the line someone made the following blunder: if p < 0.05 is “significant”, then p < 0.01 must be “highly significant”, and p < 0.001 must be very highly significant indeed! And if p-values don’t at least approach 0.05, according to this folk tale, we might as well take this as evidence for the null hypothesis and conclude that our scientific hypothesis is wrong.
These rules of thumb, despite having no evidentiary basis, infected the whole research establishment, which actually began to reward researchers for obtaining small p-values! (Publications in more prestigious journals, more grant funding, more honors and awards, etc., etc.) Consequently, new researchers quickly develop conditioned emotional responses to the p-values they see when they run a statistical analysis. These reactions are not unlike these scenes from Seinfeld:
|Very highly significant!!!||There IS an effect. Definitely for sure!|
|Highly significant!!!||There is an effect.|
|Significant (whew!)||There is an effect (most likely).|
|Approaching significance||Probably an effect, but maybe not?|
|Not significant||No effect (effect is zero?).|
But are these inferences and emotions justified? For this way of thinking about p-values to be rational, the following things ought to be true:
When there definitely is an effect, p-values should be less than 0.001 most of the time and less than 0.01 almost all the time; p-values larger than 0.10 should be exceedingly rare. (Why else would scientists feel comfortable concluding that p < 0.001 means there definitely is an effect and p > 0.10 means there is no effect?)
A result with p < 0.001 should provide a more accurate estimate of the true effect than a result with p = 0.01 or p = 0.05. (Why else would truth-seeking scientists celebrate such a result?)
If we obtain a particular p-value, say 0.01, and we repeat the exact same experiment (same population, same sample size, same methods) on a different sample, we should obtain a p-value of similar magnitude as the original, say between 0.005 and 0.05. In other words, the p-value should be very reliable across replications. (Why else would objective scientists get so intellectually and emotionally invested in a single statistic?)
So how can we know if these three expectations are reasonable? Well, here’s the problem - we don’t regularly encounter evidence in our day-to-day experiences as scientific investigators that would challenge any of these beliefs.
For Points 1 and 2, when do we ever know in the real world what the true effect is? (If we already knew the truth, we wouldn’t need to run an experiment!) So how can we possibly evaluate whether these rules of thumb capture the truth?
And for Point 3, although replication is one of the most important principles of the scientific method, full and exact replications are surprisingly uncommon. That’s because funding agencies and research journals favor innovative studies and novel findings. Replication gets a lot of lip service, but—truth be told—if a grant proposal aimed only to copy previously published work, it would be dead-on-arrival.
What is more common is the partial replication, in which some elements of a previous experimental design are retained but with some parameters modified or new ones introduced. Consequently, if these studies end up contradicting the findings of a previous study, the discrepancies will usually be attributed to “methodological differences” rather than to the unreliability of the original findings. In this way, exaggerated or false findings could persist for decades before losing credibility.
So what are we to do? Just resign ourselves to ignorance and hope this folk wisdom about interpreting p-values is right?
COMPUTER SIMULATIONS TO THE RESCUE!
While it would be practically impossible in real life to repeat an experiment 100 times under the exact same conditions—and completely impossible to know the true magnitude of an effect beforehand—we don’t need real-world experiments to test our assumptions about p-values. With just a bit of computer code, we are granted god-like powers to create a virtual reality in which we get to define the truth—and create an army of virtual scientists who try to discover that truth!
You are about to see the results of one such simulation, in which we made 100 virtual scientists conduct a randomized clinical trial to test the efficacy of this fake drug from The Onion: “Made by Pfizer, Despondex is the first drug designed to treat the symptoms of excessive perkiness.”
Here is the R code that builds our virtual world. (Don’t worry if you don’t know R. We’re about to explain what this does in plain English. This is just to show you that it only takes a few lines of code to do this!)
First, we define what will happen when one of our virtual scientists conducts a clinical trial on Despondex (the `simulate_trial` function).
The Despondex treatment group is made up of 63 patients, randomly drawn from a population whose “perk” scores are normally distributed, with a mean of 0 and a standard deviation of 1.
The placebo control group is also made up of 63 patients randomly drawn from a population whose perk scores are normally distributed with a standard deviation of 1—but a different mean of 0.5.
Thus, we have created a virtual reality in which Despondex reduces perkiness by half a standard deviation. To give some real-world examples of drugs that improve patients by about 0.5 standard deviations, antihypertensives for hypertension and corticosteroids for asthma have an effect size of 0.56, and antipsychotics for schizophrenia have an effect size of 0.51.
Finally, with the `replicate` function, we have 100 different virtual scientists run this experiment and use a t-test to evaluate the null hypothesis (that there is no difference between Despondex and placebo).
To summarize, we have 100 independent scientists run the exact same experiment, drawing from the exact same populations, so any differences in the p-values they obtain can only be attributed to random sampling error.
So how many of our virtual scientists do you think will obtain p < 0.001?
How many will get a p-value between 0.001 and 0.01, or between 0.01 and 0.05?
How many will find a p-value that “approaches significance”?
How many will find a p-value greater than 0.10 (and incorrectly conclude that there is no effect)?
Go ahead, take a guess…
Below are the results reported by 5 of our virtual scientists (all of whom, for some reason, appear to be identical clones of Elaine Benes from Seinfeld).
The solid vertical line is the null hypothesis: that there is 0 difference between groups.
The dashed vertical line is the difference that we know to be true. (Because we built the simulation!)
The red dot is the mean difference between groups that was observed by each of our Elaines, and the horizontal bar is the 95% confidence interval for that difference.
First notice how, with the exact same ground truth, one of the Elaines finds a difference that is “approaching significance”, one finds a result that is “very highly significant”, and the other three Elaines get something in between.
Now take a closer look at that “very highly significant” result of p < 0.001 that everyone is celebrating.
The reason they’re happy is because that’s the sort of dramatic finding that might be accepted by a top-tier journal such as Nature. But behold, it’s also a HUGE exaggeration of the true effect size! When someone tries to replicate this Nature study, they’re unlikely to see this big of an effect again.
(Are you starting to see how this ties into the replication crisis?)
And now look at the studies that actually get the true effect size about right - they have p-values of 0.005 and 0.015. Do those effect estimates look all that different to you? Do you really think one is categorically more significant than the other?
Let’s look at another 5 of our Elaines:
Woah! P = 0.984?! How can this be? This last Elaine had the same sample size as all the other Elaines! And we know that there really is a sizeable effect! The p-value should at least be “approaching significance”, right? Anyone running a real experiment that obtained a p-value this large would almost certainly conclude that the drug was ineffective and move on to a different research question. But this is why we say “fail to reject the null”, folks, because the null hypothesis can be wrong—even with a p-value of 0.98!
So we know at least one of our Elaines might be getting a Nature paper while at least one might be contemplating a career change—even though they both ran the exact same experiment! (Hardly seems fair, does it?).
Here’s the breakdown for all 100 Elaines:
|0.001 < P <0.01||36%|
|0.01< P <0.05||26%|
|0.05< P <0.10||7%|
If you add these up, 80 Elaines got p < 0.05, and 20 Elaines got p > 0.05. That, my friends, is exactly what was supposed to happen. You see, we chose a sample size of 126 (63 per group) because this is the sample size that results in 80% statistical power for a t-test when the mean group difference is 0.5 standard deviations.
POWER AND THE WINNER’S CURSE
If you’ve taken statistics, you’re no doubt familiar with the concept of power, but let’s review.
Power is the probability of obtaining a p-value less than 0.05 (or whatever significance threshold you specify) given that the hypothesized effect is true. This is kind of meta, huh?
We’re talking about a probability—calculated under the assumption that the null hypothesis is false—of obtaining another probability calculated under the assumption that the null hypothesis is true! Before your head explodes, let’s break this down:
You have a p-value, which tells you how frequently you should expect to see data as extreme as yours when there really is no effect.
You have a threshold, usually p < 0.05, at which you plan to reject the null hypothesis that there is no effect.
You have power, which tells you how frequently you should expect to meet that threshold when there really is an effect of a certain magnitude.
Power depends on three things: the sample size, the true effect size, and the p-value threshold you choose. As any one of those three things goes up, so does power. So the power calculation tells us that, with a sample size of 126 and a true effect of 0.5, we would have an 80% chance of obtaining a p-value less than 0.05 and a 20% chance of obtaining a p-value of 0.05 or greater. True to form, 80 of our virtual scientists got a p-value less than 0.05, and 20 did not.
Why did we pick 80% power? Because, just as p < 0.05 is the conventional criterion for “statistical significance”, 80% power is the conventional criterion for “adequate power”. Does that number come from some kind of sophisticated cost-benefit analysis? Not at all. We can trace its origin to an eminent statistician, Jacob Cohen, but his reasoning on this matter is entirely subjective.
In Cohen’s mind, a false positive felt about four times worse than a false negative. So if we view 5% as an acceptable risk for a false positive (i.e., the p < 0.05 criterion), then the acceptable risk for a false negative should be 4 times that, or 20%. The complement of this is, voila, 80% power.
Apparently this bit of hand-waving—an arbitrary multiplier of 4 applied to an already arbitrary criterion—was met with a chorus of, “Yeah, that sounds about right!”, because it was quickly adopted as the “gold standard” for determining sample-size requirements. We, however, think this is yet another example of how human beings, including most scientists, are deeply flawed when it comes to their intuitions about probability. In particular, due to some quirk of the human brain, most people will simplify a chance percentage into one of three categories:
So, weirdly, if we told you there’s an 80% chance that your experiment will find a significant result, that experiment somehow sounds like a really good investment of your time and resources. But if we told you that 1 out of 5 of your experiments will fail to find a significant result, for no other reason than you didn’t recruit enough study volunteers… well, you’d probably want to recruit more volunteers! (We don’t need to point out that an 80% chance of success is the same as a 1-in-5 chance of failure, do we?)
Most researchers have at least some vague notion that a non-significant result might be a function of “low power” rather than a flaw in their hypothesis or study design. However, in our experience, few researchers are aware of the dramatic effect that power has on the distribution of p-values that one is likely to obtain. Hence, the whole “approaching significance” fallacy - the false notion that if a study is underpowered, one may not get a p-value less than 0.05, but one will nonetheless get a p-value approaching 0.05 (probably less than 0.10 and almost certainly less than 0.20)—if there really is an effect, of course.
So remember that simulation we created? It was just a colorful reproduction of one of the simulations reported by Halsey et al. in their 2015 Nature Methods paper, The fickle P value generates irreproducible results.
The following figure is from that paper. Each one of these histograms shows the frequency of how many of their “virtual scientists” obtained a p-value within a given range—except they ran 1000 replications instead of 100 and simulated 4 different sample sizes: from left to right, 10 per group (N = 20, power = 18%), 30 per group (N = 60, power = 48%), 64 per group (N = 128, power = 80%), and 100 per group (N = 200, power = 94%).
As you can see, the assumption that p-values will “approach” 0.05 when power is low—and let’s not mince words here—is complete and utter bullshit.
When power is close to 50%, getting a p-value greater than 0.20 is just as likely as getting a p-value between 0.05 and 0.20.
And when power is less than 20%, getting a p-value greater than 0.20 is more than twice as likely as getting a p-value between 0.05 and 0.20. Those p-values aren’t just failing to approach significance—they’re running away from it!
Only when power is very high (94%) would those folk-interpretations have some merit: when there definitely is an effect and the sample size is very large, the vast majority of experiments will indeed find a p-value less than 0.001, and p-values greater than 0.10 will indeed be exceedingly rare.
However, for smaller samples or weaker effects, believing that a p-value tells you something definitive about the likelihood of your hypothesis is incredibly foolish. And—hate to burst your bubble—but with the exception of very large, multisite clinical trials, published studies are rarely this well-powered. Much of preclinical research is a lot closer to 20% power than it is to 80% power, much less 94% power.
So why would researchers bother running an experiment with 20% power when, even if the true effect is as large as they think, there’s only a 1-in-5 chance they will get a p-value less than 0.05?
Well, a study with 20% power is a lot easier to conduct (smaller samples = faster and cheaper), and researchers are often short on time or money (usually both). “So,” they reason, “we’ll run a small pilot study. Sure, we probably won’t find anything significant, but if we see a trend then we can use it as “preliminary data” in a grant proposal to fund a larger study. Who knows, maybe the true effect size (and our actual power) is larger than we think, and, even if it’s not, a 1-in-5 chance ain’t so bad—we might get lucky. Anyway, it doesn’t hurt to try….”
That may sound reasonable, but here’s the problem: power doesn’t just impact the odds of getting p < 0.05; it also impacts the precision of the test statistic. The likelihood that an observed effect approximates the true effect has nothing to do with p-values; it has everything to do with power.
We got a hint of this before, when we saw in our simulation that the result with the smallest p-value, less than 0.001, was a far worse estimate of the true effect than a result with p = 0.08.
Let’s return to those simulations. But this time, instead of just showing you five replications at a time, we’re going to show you all 100, sorted from the replication with the smallest observed effect (the one that made Elaine cry) to the largest. Remember, that was a simulation with 80% power. This time we’re also going to show you what the same simulation looks like with 20% power.
The thing that probably pops out at you the most is just how much wider the confidence intervals are for 20% power vs. 80% power, but look closer because that’s not the only difference.
Notice how with 80% power, the vast majority of observed mean differences (the red dots at the center of each confidence interval) hug pretty close to the true mean difference (the vertical dashed line). Now look at what happens with 20% power - hardly any of the observed differences get close to the true difference. It is very unlikely that an observed effect size from a low-powered pilot study will be anywhere close to the true effect size.
But wait, there’s more! The rectangles we’ve drawn at the bottom of each plot highlight the proportion of studies that substantially overestimate the true effect size. Notice how the rectangle for 20% power is a lot bigger than for 80% power? That means underpowered studies are much more likely to exaggerate effect sizes, and the potential magnitude of this exaggeration is much greater. With 80% power, even the most optimistic replication finds an effect size that is only about double the true effect size; with 20% power, a doubling or even tripling of the true effect size is commonplace.
But wait, there’s still more! Take a closer look at the handful of replications under 20% power that do get the true effect right (i.e., are centered around the dashed vertical line). Notice how the confidence intervals for every single one of them include the solid vertical line at 0, the null hypothesis. That means they all have p-values greater than 0.05, and that means that the true mean difference will never be judged statistically significant by an underpowered study.
And now look at the replications whose confidence intervals don’t include 0, meaning their p-values will be less than 0.05. Every single one of them observes an effect size 2-3 times larger than the true effect size. That means that the only way for an underpowered study to find a statistically significant difference is to gravely overestimate the true difference.
THE TALE OF THE UNDERPOWERED STUDY THAT FOUND A SIGNIFICANT RESULT
So, in light of this knowledge, let’s revisit the rationale for doing a small pilot study. What is typically going to happen with such a study?
Well, assuming it has about 20% power, there’s about a 2-in-5 chance that the observed effect is so small (or even opposite to the hypothesized direction) that the researchers conclude that there’s no point in using it in a grant proposal. They file the results away and move on to a different idea for a pilot study.
Then there’s about a 2-in-5 chance that they get the evidence of the “trend” they’re looking for, i.e., an effect that is not statistically significant but that is within 0.25 standard deviations of what they hoped to see. They write a grant proposal to collect more data so that they will have decent statistical power, which is the best possible outcome because collecting more data is exactly what they need to do.
But then there’s a 1-in-5 chance that they will get a significant result right off the bat, even with their small, crappy sample. And here’s where things start to go off the rails because most researchers seem to believe that low power acts only in one direction: to make effects appear weaker than they actually are. Of course, we just showed you that the effect-size distortion cuts both ways: for every underpowered study that underestimates the true effect, there is an underpowered study that overestimates it.
But, not appreciating this, our researchers will reason that if they made it past the p < 0.05 hurdle, they’re in the clear. They’ll be very excited to have discovered a much larger effect than they’d hoped for, and they’ll convince themselves and others that they should have expected this large of an effect all along. Now, if the researchers go ahead and publish their study, what started off as a “win” turns into a “curse”. Here’s why.
Because they have a small p-value, they—not the other 4 labs that tried this experiment—will be the ones to get published first. If it’s a really novel, sexy finding, it may even get published in a high-impact journal. That may seem like an unqualified win: they made it to the finish line first, so they get the fame and glory. Problem is, the reason they got to the finish line first is because they ran a fast and cheap study and got lucky, which means they’ve reported an effect that is grossly exaggerated. And that’s the winner’s curse.
Other researchers may be inspired to follow up with similar studies, using this original study’s results to determine sample-size requirements. They will plug the published effect size into a power analysis—not adjusting for the fact that the true effect is likely 2-3 times smaller.
Or, more likely (because power analysis is a tiresome chore), they’ll just mimic the same sample size that the first paper used. They’ll foolishly reason, “If a study published in a top-tier journal used a given sample size, then a replication with the same sample size will also find significant results.”
Either way, now we’ll have several labs running severely underpowered experiments and, because the power is still stuck at 20%, only one out of five will replicate the result. (Recall how fewer than 25% of drug-discovery experiments could be reproduced? That’s very much in line with what we’d expect from this scenario.)
So now what happens?
Well, the 20% of labs that replicate the result right away will jump on the publication bandwagon, so you’ll get a few initial publications supporting the original reported effect size. This lends the finding even more credibility, so the 4 out of 5 labs that failed to see an effect will worry that maybe they did something wrong. They’ll scrutinize their work for mistakes and likely find some (no one’s perfect!) and make a second or third attempt before they try to publish a negative result. And, if the chance of getting a significant result is 1/5 for a single study, a researcher who conducts three such studies has an almost 1/2 chance that at least one will find statistical significance. (Third time really is the charm?)
Now we start to accumulate selection bias in the published literature. Instead of reporting all the replication attempts—failures and successes alike—researchers tend to misattribute failures to experimenter error and successes to experimenter rigor. Reporting the failed replication attempts—so the rationalization goes—would just complicate the narrative. So they only report the successful replication, and the underpowered studies continue.
Of course, there’s an equal chance that a lab attempts three replications, and not one of them pans out. These researchers may get on their high horse and assert that the original finding was bogus and that there is no such effect. But their paper is likely to be reviewed by peers who have observed the effect, whose research careers have become quite invested in the effect, and who are going to feel rather put out by this attitude. They will look for methodological deviances from their prior work and use them to argue that the paper should be rejected.
And so the exaggerated finding may “coast” on publication bias for a while, but eventually the failed replication attempts will get published. A new consensus will emerge that “the literature is mixed”. Aka some studies report strong effects, and some studies report no effects. But the human desire to make sense out of noise is strong, so researchers on “the effect is real” bandwagon will start to look for differences between the studies that find effects and those that don’t. And because, in the real world, no two studies are ever exactly the same, they’ll likely find some.
Maybe studies that observe the effect tend, by chance, to have more females than males, so maybe the effect is conditional on gender. More underpowered studies are then run to test that hypothesis, and the problem mushrooms from there. Some studies will report that females show a stronger effect than males; others will report that males show a stronger effect than females. Then the researcher needs to look for yet another conditional variable that is capable of reversing the conditional effect of gender….
Maybe eventually someone will re-run the original study with a much, much larger sample and finally bring some sanity to this enterprise and conclude there really is an effect. But it’s only half as large as what the original study found, and the sample size needs to be about four times larger to replicate it consistently.
(Think this is just a story we made up to scare you? Read this: Why Most Discovered True Associations Are Inflated)
REAL-WORLD EXAMPLES, ANYONE?
Are you about done with simulations? Well, do we have a treat for you! Check out the below real-world example (Sattar et al., Lancet, 2010) investigating the increased risk of diabetes from statins:
Note the resemblance between this real-world meta-analysis and the Despondex simulations we just talked about. The true effect size appears to be an odds ratio of 1.09. The estimated effect size from any one trial appears to dance around this value. Some of them will show a “highly significant” effect (JUPITER & PROSPER). Some “approach significance” (HPS & ALLHAT). Several would show no effect or an effect that trends in the opposite direction. P-values would be highly variable, but note that all CIs include the true effect.
Still not convinced? Here’s another clinical real-world example published by McCarren, Hampp, Gerhard, & Mehta in the Am J Health-Syst Pharm. 2017; 74:1262-6. This table shows a clear disagreement between all the subsamples in a large range of p-values:
I mean, just look at the first potential predictor of asthma. It’s the exact same data that you are pulling subsamples from, but you get all colors of the rainbow in terms of p-values.
BUT DON’T TAKE OUR WORD FOR IT.
Reason and evidence not enough? Need an appeal to authority? We’ve got you covered!
Statisticians have been sounding the alarm about p-values for years to no avail, so in 2016 the American Statistical Association (ASA) finally took the unusual step of issuing a public proclamation. We encourage you to read their full statement, but here are the bullet points. (We hope we’ve already convinced you of these.)
P-values can indicate how incompatible the data are with a specified statistical model.
P-values do not measure the probability that the studied hypothesis is true or the probability that the data were produced by random chance alone.
Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
Proper inference requires full reporting and transparency.
A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
ONLY THE TIP OF THE ICEBERG
Believe it or not, all of the problems we’ve talked about so far refer to the best-case scenario: that is, we’ve only discussed what happens when there really is an effect but, because of inadequate sample size, p-values are not reliable indicators of its significance. Combined with publication selection bias, this has led to an epidemic of initial findings that report exaggerated effects that cannot be consistently replicated.
It is a repeating pattern that we are all too familiar with: an initial pilot study finds an amazing result that gets everyone excited, only to leave everyone disappointed when a larger clinical study finds a far weaker result. Now you understand why this happens and how a little more numeracy (that’s like literacy, but for probability and statistics) could help us get off this roller coaster.
But we haven’t begun to talk about what happens when there really is no effect but practices like data dredging, p-hacking, and HARKing (hypothesizing after results are known) lead researchers to misidentify pure noise as statistically significant. In Part 2, we’ll explain what goes horribly wrong when researchers use the same data sets to both discover and test their hypotheses. (Spoiler: those two things need to be kept separate!). And we’ll advise you how to best evaluate research in light of all this and how you can be part of the movement to improve reproducibility in science.