Why most scientific studies are wrong — and what science is doing about it

2026-05-20

In 2016, twenty-three laboratories across eleven countries ran the largest test ego depletion had ever received. The study enrolled 2,141 participants. The protocol was identical at every site, registered publicly before a single measurement was taken — a methodological commitment that barred, by design, the analytic adjustments that earlier ego depletion research had never needed to rule out, because nobody had asked them to. Effect size: d = 0.04.

The prior meta-analysis, by Hagger and colleagues in 2010 in Psychological Bulletin, had synthesized 198 published tests and found d = 0.62. Ego depletion — the theory that self-control draws on a single finite resource that depletes with use — had been staked out by Roy Baumeister and colleagues in a 1998 Journal of Personality and Social Psychology paper and apparently confirmed, across hundreds of studies, for nearly two decades. It entered undergraduate textbooks. It reached management consultancies, productivity coaches, and HR departments that redesigned meeting schedules to protect employees from burning through their willpower before lunch.

D = 0.62 to d = 0.04.

This was not one underpowered study failing to replicate a well-powered literature. Twenty-three labs ran simultaneously. Total sample: nearly four times anything ego depletion had previously faced. They found essentially nothing.

What makes d = 0.04 the more remarkable figure is not its size. It is who produced it. The 2016 study was organized by scientists, run with standard methods, published in Perspectives on Psychological Science, and coordinated through a pre-registration system built specifically to prevent the methodological flexibility that had characterized the original literature. The correction came through the same institutional machinery as the findings it demolished.

Science did this. Not journalism. Not a whistleblower. Science, using science’s own tools, turned on one of science’s own most-cited theories and found nothing there.

The pattern

Ego depletion is not an anomaly. It is what the literature produces at scale.

In 2010, Dana Carney, Amy Cuddy, and Andy Yap published a study in Psychological Science. Forty-two participants. Two minutes in an expansive physical posture — arms wide, hands on hips — elevated testosterone, lowered cortisol, and increased tolerance for financial risk. Those were the claims. Cuddy presented the findings in a 2012 TED talk that has since reached more than 76 million viewers. The finding moved into executive coaching, corporate training programmes, and job-interview preparation advice telling candidates to pose like a superhero in the bathroom before walking into the room.

By 2015, replications were failing to find the hormonal effects. In 2016, Dana Carney — the paper’s first author — published a statement explicitly abandoning the theory. She named the problems directly: a sample of 42, multiple dependent variables tested simultaneously, publication of whichever survived. Cuddy did not recant. She narrowed the claim — from measurable hormonal changes to subjective feelings of confidence — and continued presenting the research. Carney named the methodology and walked away. Cuddy adjusted the conclusion and kept going.

John Bargh, Mark Chen, and Lara Burrows published their elderly priming study in the Journal of Personality and Social Psychology in 1996. Participants exposed to words associated with old age — “Florida,” “bingo,” “wrinkled” — walked measurably more slowly down the hallway afterward. The finding became a centrepiece of social priming research and appeared in Daniel Kahneman’s Thinking, Fast and Slow (2011) as evidence for the automatic, thoroughgoing influence of context on behavior.

In 2012, Stéphane Doyen and colleagues replicated the study in PLoS ONE. They used infrared timing equipment rather than a researcher with a stopwatch, blinded the experimenters to the hypothesis, and used a larger sample. The walking-speed effect appeared only when the experimenters expected to find it. The slowing wasn’t from the priming. It was from the experimenters. Bargh responded by calling the Belgian researchers “incompetent or ill-informed” in a blog post he subsequently deleted.

Kahneman, who had staked a portion of his book’s credibility on social priming findings, sent an open letter that September addressed to researchers in the field. He saw, he wrote, “a train wreck looming.” He later acknowledged in public that his treatment of social priming in Thinking, Fast and Slow had been too credulous. That acknowledgment was not forced from him. He offered it.

The systematic picture came from the Open Science Collaboration, which spent four years replicating 100 published psychology experiments. Published in Science in 2015: 97% of originals had statistically significant findings. 36% of the replications matched that criterion. Average replication effect sizes were roughly half the originals. This was not a theoretical critique. It was a direct empirical measurement — here are 100 studies, here is how many hold up when someone actually runs them again.

Then there is Daryl Bem. In 2011, Bem published a paper in the Journal of Personality and Social Psychology reporting nine experiments providing evidence of extrasensory perception. Participants appeared to know what images a computer would display before the computer had selected them. The paper used entirely standard statistical methods — identical to those used by any other paper in the same journal that year. Replications failed. When methodologists examined the original, Bem’s methods were impeccable by the field’s prevailing standards. Nine experiments, all significant. For precognition.

If standard methods can produce nine apparently significant studies for an effect that is physically impossible, the methods are the problem. Bem did not cause the replication crisis. He made the structural problem impossible to ignore.

The exception that proves nothing

In 2011, Diederik Stapel, a Dutch social psychologist at Tilburg University, was dismissed after an investigation found he had fabricated data wholesale — across years of research, in dozens of published studies. He invented datasets entirely.
Stapel is not an explanation for the replication crisis. He is the wrong explanation, and keeping him wrong matters. The researchers whose findings collapsed — the ego depletion labs, the power pose researchers, the social priming literature — were not fabricating data. They were running studies and writing papers using methods the field accepted and rewarded. The replication crisis did not require fraud to operate. It required the system to run as designed, with researchers responding to the incentives in front of them. Stapel is the outlier. The systemic argument explains everything else, and it does not need to reach for anything more exotic than normal human behavior in a badly designed institution.

How the system produces this

Every finding described above passed peer review. Every study used methods that were standard for the field. No data was fabricated. The process worked exactly as designed.

That is the problem.

Robert Rosenthal named the file drawer problem in a 1979 Psychological Bulletin paper — though Theodore Sterling had documented the phenomenon empirically two decades earlier, finding in 1959 that 97.3% of papers in four major psychology journals reported statistically significant results for their primary hypotheses. The logic is simple and its consequences are not. Journals publish significant findings at a much higher rate than null results. Failed experiments — studies that found no effect, the wrong direction, or no publishable story — accumulate unpublished. The literature is not a representative sample of all experiments run. It is the selected subset that survived a filter strongly biased toward the positive.

What this produces at scale: a researcher runs ten experiments testing a hypothesis that is actually false. Nine return null results. One returns a false positive by chance — the expected outcome when a 5% significance threshold is applied to a true null. The nine go nowhere. The one gets published. The field acquires a positive study and no corrective information. Multiply this across a field, across decades, and the literature fills with chance findings dressed as accumulated evidence.

P-hacking operates inside a single study rather than across many, which makes it harder to detect and more pervasive. A researcher finishes data collection and gets p = 0.06. Borderline. They remove two participants who showed unusual response patterns before the formal analysis. Now p = 0.047. Or they add a covariate that was available but not originally planned. Or they run the analysis at multiple sample sizes and stop where the data look cleanest, arriving at p = 0.048. Nothing in the published paper will record any of this — only the final analysis, presented as if it were the only one run. Simmons, Nelson, and Simonsohn demonstrated in a 2011 Psychological Science paper that just a handful of these “researcher degrees of freedom” — decisions about participants, covariates, sample size, and stopping rules — can inflate the false positive rate from the nominal 5% to over 60%. The researcher is not lying. They are responding rationally to a reward structure that treats p < 0.05 as the entire difference between success and failure.

HARKing — Hypothesizing After Results are Known — completes the picture. Norbert Kerr coined the term in a 1998 Personality and Social Psychology Review paper. The mechanism: a researcher runs an experiment expecting to find effect X, finds effect Y instead, constructs a hypothesis that would have predicted Y, and writes the paper as if that hypothesis was there from the beginning. The published article reads as theory confirmation. It is discovery wearing confirmatory clothes. The incentive is structural: confirmatory findings are more publishable than exploratory ones. A result reported as predicted is received differently from the same result reported as stumbled upon. Every exploratory finding has a reason to present itself as confirmatory, and no one reading the final paper can tell the difference.

These three mechanisms running simultaneously produce an entirely predictable mathematical result. John Ioannidis worked out the arithmetic in a 2005 PLOS Medicine paper. In fields where the ratio of true to false hypotheses being tested is low — which describes most of social psychology, where speculative theories are tested in cheap lab experiments with small convenience samples — where samples are underpowered, analytical flexibility is high, and publication incentives strongly favor positive results, the published literature will contain more false positives than true ones. Without fraud. Without incompetence. Just the system operating as built.

The specific arithmetic: if 10% of tested hypotheses are actually true, with 80% statistical power and a p = 0.05 threshold, running 100 tests yields roughly 8 true positives and 5 false positives. The false discovery rate — the fraction of published positives that are wrong — is around 38%. Lower the base rate of true hypotheses, as you would in a field that tests highly speculative claims in small, flexible studies, and the false discovery rate climbs past 50%. The literature can be more wrong than right. The system only needs to be built the way it was.

What p < 0.05 actually means — and doesn't

A p-value below 0.05 means: if the null hypothesis were true, results this extreme would appear by chance less than 5% of the time. It does not mean there is a 95% probability the finding is real. It does not confirm the hypothesis. Ronald Fisher, who introduced the p-value, intended it as a weak signal worth following up with further experimentation — not a binary publication criterion.
The null hypothesis significance testing framework that became psychology's dominant standard is a hybrid of two incompatible statistical philosophies: Fisher's significance testing and the Neyman-Pearson hypothesis-testing framework. Neither originator endorsed the combined approach; both criticized the other's system in print. The p < 0.05 threshold — borrowed partly from agricultural statistics, partly from industrial quality control — became a universal publication gate through institutional inertia rather than statistical logic. Its misuse collapses a continuous measure of evidence into a binary pass/fail decision, and in doing so creates exactly the incentive to cross the threshold by whatever means are available.

Why rational people did this

The question is not why bad scientists did bad science. Most of the researchers whose work collapsed were neither corrupt nor incompetent. The question is why the incentive architecture made this behavior predictable from any scientist working in the field.

Academic careers depend on publications — specifically on publications in high-status journals, which prefer statistically significant results. A researcher who runs three experiments and gets one significant result publishes the one. The other two go into the file drawer. No malice required. The system evaluates output, not process. It rewards the significant result. It ignores the null.

Every researcher faces the same calculation.

Grant funding compounds the problem. Research funding is substantially easier to secure with a track record of published findings. Null results don’t build that track record. A researcher who consistently finds that their hypotheses don’t hold faces harder grant cycles, a thinner publication list, and diminishing institutional support — all flowing from a CV that looks, by the metrics that matter, like failure. The incentive to find positive results is not a temptation to be resisted. For many researchers, it is a condition of continued employment.

Peer review cannot fix this. Reviewers are experts in the field being reviewed. They are also invested in that field’s prior findings. A paper challenging an established result — questioning social priming, say — faces reviewers who have built theoretical frameworks, written grants, and published papers that depend on those findings holding. This is not conspiracy. It is what happens when expertise is combined with professional investment in a specific set of results. The bar for challenging an established finding is reliably higher than the bar for confirming one, under methodologically equivalent conditions.

None of this excuses lazy research design or motivated reasoning, both of which were present. In Stapel’s case, none of it excuses fabrication. The point is that the system would have produced the same distorted literature without any of those individual variations. The mechanism did not require bad people. It required people in a badly designed system, responding to the signals in front of them.

If the diagnosis is individual moral failure, the fix is policing. If the diagnosis is structural incentive misalignment, the fix is redesign. The diagnosis is structural.

What science is actually doing

The replication crisis went public around 2011. By then the methodological critique had been building in specialist literature for years; what changed was the visibility — the OSC paper, the findings behind the TED talks that hadn’t replicated, the Bem debacle making it impossible for the field to pretend the problem was marginal. What followed was a genuine reform effort, uneven in uptake and incomplete in scope, but distinguishable from institutional theatre by one thing: some of the reforms demonstrably changed outcomes.

Pre-registration is the most widely adopted reform. Researchers register their hypothesis, sample size, and analytic plan in a public repository before data collection begins. This makes HARKing visible: if the analysis that was run differs from the analysis on record, the discrepancy is documented. It formally separates exploratory from confirmatory research — both are legitimate, but in the old system they looked identical in published form, and exploratory findings were routinely treated as confirmatory evidence. The Open Science Framework and AsPredicted are the primary platforms, and pre-registration is now expected in many leading psychology journals.

But pre-registration does not eliminate publication bias.

It changes what researchers do before they write. It does not change what editors decide to accept. Olmo van den Akker and colleagues tested this directly in a 2023 Behavior Research Methods paper, comparing 193 pre-registered psychology studies to 193 matched non-pre-registered studies. Positive-result rates did not drop substantially. Effect sizes were numerically smaller in pre-registered studies — mean r of 0.29 versus 0.36 — but the difference was not statistically significant. Pre-registered studies had larger samples and more frequent power analyses: the transparency infrastructure was working. The publication incentive was unchanged by it.

That requires a different mechanism.

Registered Reports are the structurally significant reform with demonstrated impact. Under this model, a journal evaluates and accepts or rejects a paper based on the research question and method — before data is collected — and commits to publish the result regardless of outcome. Publication bias is eliminated at its source: the editorial decision is made before the result exists. Registered Reports were introduced at the journal Cortex in 2013, pioneered by editor Chris Chambers, and have since been adopted by more than 300 journals.

The evidence for their effect is not theoretical. Anne Scheel, Mitchell Schijen, and Daniël Lakens, in a 2021 Advances in Methods and Practices in Psychological Science paper, compared results in Registered Reports to a random sample of standard psychology publications. Standard publications: 96% statistically significant results. Registered Reports: 44%. That 52-percentage-point gap is publication bias measured, not estimated. It shows exactly what happened to positive-result rates when journals removed the ability to reject null results after the fact. Pre-registration leaves that ability intact. Registered Reports remove it before the data exist. That is a structural difference, not a cosmetic one.

The Many Labs projects provide the closest thing psychology has to ground-truth systematic replication data. Many Labs 2 — Klein and colleagues, 2018 — replicated 28 effects across 125 samples in 36 countries with 15,305 participants. Using standard significance criteria, 54% replicated. More revealingly: 75% of replication effect sizes were smaller than the originals. The effects that replicated did so consistently across samples from different countries, universities, and demographic groups. The ones that didn’t evaporate, not merely shrank. Many Labs 2 does what single-lab replications cannot: it distinguishes genuinely robust effects from effects that exist only in one lab’s particular experimental conditions.

Open data requirements close the loop. Several journals now require raw data and analysis code to be shared publicly. This makes the analysis checkable in a way peer review cannot achieve — a reviewer sees the write-up; a researcher with the raw data can see whether the write-up accurately represents it. Brian Nosek’s Center for Open Science has driven adoption of transparency standards across hundreds of journals through an Open Science Badges program that marks pre-registration, data sharing, and materials sharing in published papers.

Bayesian analysis and minimum sample-size requirements deserve a brief note — both improve individual studies without touching the structural problem. Bayesian analysis asks how new data should update prior belief rather than producing a binary significance decision; it is growing in psychology and cognitive science, and it produces better individual analyses. But it has not restructured publication incentives or produced a measurable population-level reduction in false positives, because it does not change what gets published. Minimum sample size and power analysis requirements are similarly valuable at the level of study design, and similarly limited: adoption is uneven, not a general standard. Neither has the structural bite of Registered Reports, which intervenes at the editorial decision rather than at study design.

The lag and the open question

The reforms are real. The Scheel data and the Many Labs results are not contested. What they do not address is the scale of what came before them — or where the reforms have not reached.

The Open Science Collaboration found that roughly 64% of a 100-study sample failed to replicate cleanly. Psychology publishes thousands of studies annually. Most of the pre-crisis literature has not been retested, and most of it probably never will be. Rigorous multi-lab replication is slow, expensive, and requires coordinating dozens of teams around a single protocol. The Hagger multi-lab ego depletion study took years to organize. The original Baumeister experiment took one afternoon. That asymmetry is structural: generating a finding is cheap; correcting it with the rigor the original lacked is not. Textbooks that cited ego depletion as established science were not recalled in 2016. The correction exists in the scientific record. Its reach outside that record is limited in ways that no reform of peer review can fix.

The incentive architecture, more broadly, has not fundamentally changed. Pre-registration and Registered Reports are available in hundreds of journals and are expected in leading psychology venues. They are not mandatory anywhere. Tenure committees still evaluate researchers primarily on publication counts and citation metrics. A junior researcher who runs well-powered, pre-registered studies and publishes fewer papers produces a thinner CV than one who runs many small, flexible studies and publishes more positive results. The pressure toward positive findings has been partially counteracted in subfields that have adopted the reforms. The pressure remains.

Medicine and nutrition are where this gap matters most — and, as of now, where the reforms documented here have arrived least. To be precise: these are not fields this article has audited. They are invoked because they are the horizon of the unfinished work, the territory where the same structural failures produce consequences that academic psychology’s failures don’t. Ioannidis’s 2005 JAMA paper found that a third of highly cited clinical research findings were later contradicted or shown to have substantially smaller effects — medicine’s version of the same pattern. In nutrition, the situation is arguably worse: the studies are typically small and observational, industry funding is endemic, there is no practical way to randomize dietary exposure over years, and decades of dietary recommendations have oscillated as underpowered findings cycled in and out of favor.

The dietary fat story is not evidence of corrupt nutrition science. It is what small, confounded, non-pre-registered studies produce when they drive public health policy — which is the same thing academic psychology produced when it drove management practice and self-help culture.

The reforms work. In the fields that adopted them, the Scheel data and the Many Labs results confirm it. The fields that haven’t are the fields where the stakes are highest.

What remains

HR departments restructured meeting schedules around decision fatigue because ego depletion had been established. Self-help books told readers to protect their willpower reserves by making fewer choices earlier in the day. Productivity culture built a minor industry around the concept of cognitive fuel. These weren’t metaphors. They were practical instructions derived from scientific findings that, under actual scrutiny, evaporated. The Hagger multi-lab result was published in 2016. The textbook editions that cited ego depletion as established science were not recalled. The downstream applications of the original literature continue circulating in forms that the scientific correction cannot reach — which is the asymmetry the reforms were not designed to address.

The twenty-three labs that demolished ego depletion used pre-registration, coordinated multi-site protocols, and a commitment to publish regardless of outcome. Those are the structural tools that Registered Reports institutionalize. What a comparable audit would find in medicine’s clinical literature or nutritional guidance — that question hasn’t been answered because the audit hasn’t been run. The structural conditions are the same. The incentive architecture is largely the same. Psychology was dragged into a public reckoning over a decade, after visible failures accumulated to a critical mass. Whether other fields adopt these reforms before a comparable reckoning forces them is a practical question, not a rhetorical one.

What the cost of the wait is, in patient decisions made on uncertain evidence and dietary guidelines that will oscillate again before they settle — nobody has put that figure together. The tools to reduce it exist. They just haven’t been used where the cost is highest.

Gen AI Disclaimer

Some contents of this page were generated and/or edited with the help of a Generative AI.

Media

Pilan Filmes – Pexels

Key Sources and References

Baumeister, R. F., Bratslavsky, E., Muraven, M., & Tice, D. M. (1998). Ego depletion: Is the active self a limited resource? Journal of Personality and Social Psychology, 74(5), 1252–1265. https://doi.org/10.1037/0022-3514.74.5.1252

Hagger, M. S., Wood, C., Stiff, C., & Chatzisarantis, N. L. D. (2010). Ego depletion and the strength model of self-control: A meta-analysis. Psychological Bulletin, 136(4), 495–525. https://doi.org/10.1037/a0019486

Hagger, M. S., Chatzisarantis, N. L. D., et al. (2016). A multilab preregistered replication of the ego-depletion effect. Perspectives on Psychological Science, 11(4), 546–573. https://doi.org/10.1177/1745691616652873

Carney, D. R., Cuddy, A. J. C., & Yap, A. J. (2010). Power posing: Brief nonverbal displays affect neuroendocrine levels and risk tolerance. Psychological Science, 21(10), 1363–1368. https://doi.org/10.1177/0956797610383437

Cuddy, A. (2012, June). Your body language may shape who you are [Video]. TED Conferences. https://www.ted.com/talks/amy_cuddy_your_body_language_may_shape_who_you_are

Carney, D. R. (2016, October). My position on “Power Poses” [Faculty statement]. Haas School of Business, University of California, Berkeley. https://faculty.haas.berkeley.edu/dana_carney/pdf_my%20position%20on%20power%20poses.pdf

Bargh, J. A., Chen, M., & Burrows, L. (1996). Automaticity of social behavior: Direct effects of trait construct and stereotype activation on action. Journal of Personality and Social Psychology, 71(2), 230–244. https://doi.org/10.1037/0022-3514.71.2.230

Doyen, S., Klein, O., Pichon, C.-L., & Cleeremans, A. (2012). Behavioral priming: It’s all in the mind, but whose mind? PLoS ONE, 7(1), e29081. https://doi.org/10.1371/journal.pone.0029081

Yong, E. (2012). A failed replication draws a scathing personal attack from a psychology professor. National Geographic. https://www.nationalgeographic.com/science/article/failed-replication-bargh-psychology-study-doyen

Kahneman, D. (2012, September 26). A proposal to deal with questions about priming effects [Open letter circulated to social priming researchers]. Available via Nature News: https://www.nature.com/news/polopoly_fs/7.6716.1349271308!/suppinfoFile/Kahneman%20Letter.pdf

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. https://doi.org/10.1126/science.aac4716

Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100(3), 407–425. https://doi.org/10.1037/a0021524

Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86(3), 638–641. https://doi.org/10.1037/0033-2909.86.3.638

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance — or vice versa. Journal of the American Statistical Association, 54(285), 30–34. https://doi.org/10.2307/2282137

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632

Kerr, N. L. (1998). HARKing: Hypothesizing after the results are known. Personality and Social Psychology Review, 2(3), 196–217. https://doi.org/10.1207/s15327957pspr0203_4

Ioannidis, J. P. A. (2005). Why most published research findings are false. PLOS Medicine, 2(8), e124. https://doi.org/10.1371/journal.pmed.0020124

Ioannidis, J. P. A. (2005). Contradicted and initially stronger effects in highly cited clinical research. JAMA, 294(2), 218–228. https://doi.org/10.1001/jama.294.2.218

van den Akker, O., et al. (2023). Preregistration in practice: A comparison of preregistered and non-preregistered studies in psychology. Behavior Research Methods, 56, 5424–5433. https://doi.org/10.3758/s13428-023-02277-0

Scheel, A. M., Schijen, M. R. M. J., & Lakens, D. (2021). An excess of positive results: Comparing the standard psychology literature with Registered Reports. Advances in Methods and Practices in Psychological Science, 4(2). https://doi.org/10.1177/25152459211007467

Klein, R. A., et al. (2018). Many Labs 2: Investigating variation in replicability across samples and settings. Advances in Methods and Practices in Psychological Science, 1(4), 443–490. https://doi.org/10.1177/2515245918810225

Chambers, C. D. (2013). Registered Reports: A new publishing initiative at Cortex. Cortex, 49(3), 609–610. https://doi.org/10.1016/j.cortex.2012.12.016

Center for Open Science. (n.d.). Registered Reports. https://www.cos.io/initiatives/registered-reports

Center for Open Science. (n.d.). Open Science Badges. https://www.cos.io/initiatives/badges

Ingrid Dahl

I work in psychology and cultural behavior, mostly helping people understand why humans make irrational decisions with complete confidence. I enjoy decoding social dynamics almost as much as quietly participating in them.