scientific reading

Replication-Crisis-Era Psychology Papers: How AI Should Not Fill Your Statistics Gap

Scholia · · 12 min read
Printed psychology paper with p-values circled in red atop a Bayesian stats textbook and handwritten critique notes

Key Anchors


The textbook version of how to read a psychology paper goes like this: read the Abstract, check whether p < .05, note the effect, move on. YouTube summaries of landmark studies — Milgram, Zimbardo, ego depletion, power poses — follow the same shape. The finding is the headline; the statistics are the proof; the proof is the asterisk. What this misses is that reading a psychology paper through the replication crisis, with or without AI assistance, requires treating the asterisk not as a conclusion but as a question. The asterisk says: something happened in this sample. It does not say what, how large, or whether it would happen again. The entire architecture of the replication crisis — the failed replications of ego depletion, social priming, the facial feedback hypothesis — lives in the gap between those two things.

The Sentence That Looks Like a Finding

"Participants in the power pose condition reported significantly higher feelings of power (p = .047, d = .35)."

That sentence, or one structurally identical to it, appears in hundreds of psychology papers from the 2000s and early 2010s. The surface reading is obvious: the effect is significant, therefore real. The misreading is not stupid — it is the reading the paper's own Abstract invites. The Abstract is the author's best argument, compressed to 250 words, with the p-value doing the work of a period at the end of a sentence.

The move to make here is not to distrust the number but to ask what the number is actually measuring. A p-value (p-Wert in German statistical literature, valeur p in French) is a conditional probability: the probability of observing a result at least this extreme if the null hypothesis were true. It is not the probability that the null hypothesis is true. It is not the probability that the finding will replicate. It is not a measure of effect size. These are not subtle distinctions — they are the distinctions that the replication crisis made impossible to ignore after 2011, when the Open Science Collaboration's large-scale replication project found that fewer than half of a sample of published psychology findings reproduced at the same effect size.

The alternate reading — that p < .05 is a reasonable threshold for belief — is not wrong in isolation. It is wrong as a reading practice because it treats the p-value as the end of the interpretive chain rather than the beginning. The sentence above contains a second number, d = .35, that most textbook readers skip entirely. Cohen's d is a standardised effect size: it tells you how many standard deviations separate the two groups. A d of .35 is, by Cohen's own benchmarks, a small-to-medium effect. In a sample of 42 participants — the original Carney, Cuddy, and Yap study used 42 — a small effect and a p-value just below .05 are almost arithmetically guaranteed to be unstable. The confidence interval around d = .35 in a sample that size is wide enough to include zero.

What the Methods Section Is Actually Doing

The Methods section of a psychology paper is where the author's generative logic is most exposed and least read. Most readers — including many trained researchers — move through it quickly, checking that the sample size is listed and the measures are named, then returning to the Results. This is the reading practice the replication crisis most directly indicts.

A printed psychology journal article spread open on a desk, the Methods section heavily annotated in pencil, with a ruler marking a specific paragraph, no faces visible

Consider the structural move the Methods section is making. Every psychology paper contains an implicit argument of the form: this operationalisation of this construct, in this population, under these conditions, produces this effect. The Abstract reports the effect. The Methods section reports the operationalisation, the population, and the conditions — the three variables that determine whether the effect is a finding about human psychology or a finding about undergraduate students at a single American university completing a task they were paid $10 to perform on a Tuesday afternoon.

"Participants were 42 undergraduate students recruited from the university subject pool. All participants completed the study individually in a private room." (Carney, Cuddy, and Yap, Psychological Science, 2010)

That sentence is doing more work than it appears to. "University subject pool" is a technical term for a convenience sample — students who sign up for studies to fulfil a course requirement. The external validity question — whether the effect generalises beyond this population — is not answered in the paper. It is not asked. The paper's argument is structured so that the question does not arise until a reader brings it from outside. This is not fraud; it is the standard operating procedure of a field that had not yet been forced to confront what "generalises" means.

The reading move here is to treat the Methods section as a set of boundary conditions on the claim. Every sentence in Methods is implicitly prefixed: this finding holds when… When you read it that way, the finding in the Abstract shrinks from a claim about human psychology to a claim about a specific experimental setup. That shrinkage is not nihilism — it is precision.

Reading the Results Section Against the Grain

The Results section of a replication-crisis-era psychology paper is where the rhetorical pressure is highest. The author has a finding; the finding needs to survive peer review; the language of the Results section is calibrated to make the finding look as robust as the statistics will allow.

"The effect of condition on feelings of power was significant, F(1, 40) = 4.28, p = .045, η² = .097."

The word "significant" here is doing double duty. In ordinary English, "significant" means important, large, worth attending to. In statistical language, "significant" means only that the result crossed a pre-set threshold for rejecting the null hypothesis. The conflation of these two senses is not accidental — it is the rhetorical engine that made the pre-crisis literature so persuasive to non-specialist readers, journalists, and TED Talk audiences. A "significant" finding sounds like a finding that matters. It is a finding that passed a filter designed to exclude chance, nothing more.

The alternate reading — that "significant" is just a technical term and readers should know that — fails because the paper itself does not maintain the distinction. The Discussion section, which follows Results, almost always moves from "the effect was significant" to "this suggests that humans respond to embodied cues of power" without marking the inferential leap. The leap is from a sample statistic to a claim about human psychology. That leap requires effect size, replication, and theoretical coherence. The p-value alone does not carry it.

Reading psychology research with this in mind means treating the Discussion section as argument, not as conclusion. The Discussion is where the author's interpretation of their own data lives — and where the most important claims are made with the least statistical support. "These findings suggest" is the phrase that marks the transition from data to interpretation. Every sentence after it is a hypothesis, not a finding.

The Pre-Registration Gap and What It Means for Reading

Pre-registration — the practice of publicly committing to a hypothesis, sample size, and analysis plan before collecting data — became a standard reform recommendation after 2011. Its relevance to reading is underappreciated. A pre-registered study and a non-pre-registered study can produce identical-looking papers. The difference is invisible in the text unless the reader knows to look for it.

The place to look is the Methods section, in the subsection on analysis plan, or in a footnote citing an OSF (Open Science Framework) registration number. If neither exists, the study was not pre-registered. That does not make it wrong — most of the foundational literature in psychology predates pre-registration as a norm — but it means the analysis plan could have been adjusted after the data were collected. This practice, known as p-hacking or researcher degrees of freedom, is not always intentional. Researchers make dozens of small decisions during analysis — which outliers to exclude, which covariates to include, whether to collapse conditions — and each decision can nudge a p-value across the .05 threshold. Simmons, Nelson, and Simonsohn demonstrated in 2011 that these degrees of freedom alone could produce a p < .05 result for a finding that was entirely false.

Reading a psychology paper from before 2015 without asking whether the analysis plan was fixed in advance is like reading a contract without checking whether the terms were written before or after the dispute arose. The paper may be entirely honest. But the reader who does not ask the question is not reading the paper — they are reading the author's best argument for their own findings.

Reading Psychology Papers in the Replication Crisis: What AI Gets Wrong

The fluency illusion is the cognitive-science term for the mistake of treating smooth, easy processing as evidence of understanding. It is worst for the best readers — the more fluent you are, the more a well-written summary feels like comprehension. Summarize-first AI tools — the chat-with-PDF category, the tools that compress a paper into bullet points before you have read a word — are fluency-illusion machines applied to scientific literature. A smooth AI summary of a psychology paper's results section will tell you the finding, the p-value, and the effect size. It will not tell you that the sample was 42 undergraduates, that the analysis was not pre-registered, that the confidence interval around the effect size includes zero, or that the finding failed to replicate in a sample ten times larger. It cannot tell you those things because it is optimised to compress, not to read.

Scholia's three-pillar frame — Skeleton, Environment, Soul — is worth trying on the next Methods section you read, even without the product. The Skeleton is the operationalisation: what exactly was measured, in whom, under what conditions. The Environment is the methodological moment: what norms governed this field when this paper was written, and which of those norms have since been revised. The Soul is the meta-problem: what question was this researcher trying to answer, and does the design actually answer it? Running those three questions against a paper is the opposite of what a summarize-first AI tool does. It is reading psychology research rather than consuming a digest of it.


Frequently Asked Questions

How do I read a psychology paper during the replication crisis without being misled by p-values?

Start with the Methods section, not the Abstract. The Abstract is the author's best argument; the Methods section is where the argument's constraints live. Check the sample size first — a p-value just below .05 in a sample of 40 is arithmetically fragile. Then find the effect size: Cohen's d, η², or r. A small effect in a small sample with a borderline p-value is the signature of a finding that may not survive replication. Finally, check whether the study was pre-registered — an OSF registration number in the Methods section is the clearest signal that the analysis plan was fixed before the data were collected.

What does p-value actually mean in a psychology paper?

A p-value is a conditional probability: the probability of observing a result at least this extreme if the null hypothesis were true. It is not the probability that the null hypothesis is false. It is not the probability that the finding will replicate. It is not a measure of how large or important the effect is. The conflation of "statistically significant" with "important" or "real" is the single most consequential misreading in the pre-crisis literature.

What is the replication crisis in psychology and how does it affect reading research?

The replication crisis refers to the systematic finding, accelerated by the Open Science Collaboration's 2015 large-scale replication project, that a substantial proportion of published psychology findings failed to reproduce at the same effect size in independent samples. For readers, it means that a published finding is a hypothesis with supporting evidence, not a settled fact — and that the quality of the evidence depends on details buried in the Methods section, not announced in the Abstract.

Can AI help with reading psychology papers through the replication crisis?

AI that summarizes a paper before you read it risks producing the fluency illusion: a smooth, confident-sounding output that feels like comprehension but has already made the interpretive decisions for you. If you are reading psychology research and want to use an AI co-reader, the question to ask is whether it lands on the specific sentence you are stuck on — the one where the statistics are doing rhetorical work — or whether it compresses the paper into a digest. Those are opposite postures.

What is pre-registration and why does it matter when reading a psychology paper?

Pre-registration means the researcher publicly committed to a hypothesis, sample size, and analysis plan before collecting data, typically via the Open Science Framework. Without it, the analysis plan may have been adjusted after the data were collected — a practice that can produce p < .05 results for effects that are not real. When reading a paper from before 2015, the absence of a registration number is not evidence of fraud, but it is a reason to weight the finding more lightly until it has been independently replicated.

What should I read in a psychology paper's Methods section to assess replication risk?

Four things: the sample size and population (undergraduates at one university are not "people"), the operationalisation of the key construct (how exactly was "power" or "ego depletion" measured?), whether an OSF registration number is cited, and how the analysis handled outliers and covariates. Each of these is a boundary condition on the claim in the Abstract. Reading them as boundary conditions rather than procedural boilerplate is the difference between reading the paper and reading the author's summary of the paper.


Stuck on the passage?

Scholia walks one passage at a time with the full-book context of the edition you uploaded. Open the PDF or EPUB you're reading at scholiaai.com and we'll land on the exact line you tripped on — then lift to mechanism.

The AI Co-Reader for Philosophy

Scholia loads your full edition first, then walks one passage at a time.

It's the structural opposite of a summariser — LAND before LIFT, with the whole book in view. Not a database, not a translation, not a chat-with-PDF that forgot the argument by page 40.

Keep reading

Closer to this passage