Back and Forth on the Value of Replication
From Stuart:
My friend Jordan Dworkin recently wrote an excellent piece titled, “How Much Should We Spend on Scientific Replication?” It is the first attempt to model the probability that funding replication studies will be more impactful than just funding new scientific studies.
I saw the piece ahead of time, of course, but after doing a little more rumination, I have a couple of thoughts that I should have offered earlier.
First, Dworkin says that we should prioritize replicating new studies, based on the typical pattern of how citations accumulate:
First is when the replication happens — the attention a paper receives, and the extent to which it drives follow-on research, are time-dependent. Papers tend to accrue ~2.5% of their total citations in the first year after publication, 7.5% in the second, and ~12% each in years 3-5; by the sixth year post-publication, the average paper has already received almost half of the direct attention it ever will. Because the capacity of a failed replication to reduce a study’s impact depends on when the replication occurs, we should not prioritize replicating studies that have already accrued a lot of citations — at that point, it may be too late to capture most of the replication’s value. Instead, we should aim to replicate papers that are likely to accrue many citations in the future.
I’m not sure I agree with this assumption that the value of replication is mainly in heading off citations early on in a paper’s lifetime.
The famous 2006 Alzheimer’s paper that turned out to be likely fraudulent was cited over 3,600 times according to Google Scholar. 414 of those citations occurred since 2021, starting 15 years after the paper was published. Here’s the historical pattern (and keep in mind that mid-2022 was when the potential fraud was uncovered):
I think it’s fair to say this was still an influential paper, 15+ years after publication, and the full extent of its influence wasn’t just in the direct citations but in all the follow-on papers and grants, many of which might not have cited the 2006 paper directly:
I think there’s something to the following stylized model of how a lot of science operates:
Paper A comes out in Year 1. It gets a fair bit of attention early on, because it does something new/surprising/useful. Papers B, C, and D then build upon it in the next year or 2. By 5 or 10 years out, Papers E through Z are now highly influential papers that build on the line of research started by Paper A. At this point, direct citations to Paper A start to drop off (unless someone just wants to be historically complete), because people are citing the dozens or hundreds of more recent papers that build upon it.
In this case, I think it could still be valuable to directly replicate Paper A with as much rigor as possible. If Paper A was wrong, it is possible that hundreds of papers in the last 10 years have been chasing after the wrong idea.
If the Alzheimer’s replication had happened earlier on, sure, that would have been even better. But the mere fact that time has elapsed and direct citations have dropped doesn’t necessarily mean that the paper is any less influential. It could be quite the contrary: The paper’s seminal status is now so embedded in the field that direct citations aren’t capturing anywhere near the amount of influence it has, and direct replication would still be valuable.
Here’s another wrinkle on citation patterns, from another famous paper: Langer’s 1975 psychology paper “The Illusion of Control.” (hat tip to Science Banana for this example).
Its citations over the years have been impressive, and even grew considerably in the 4th and 5th decades after publication!
But was it a reliable paper? Well, a replication paper in 2021 had over 10,800 participants (17 times the original sample size), and found that the original paper didn’t stand up.
According to Google Scholar, the replication paper has only been cited 42 times in 4 years (even while the original paper still gets cited 200 times per year). Which is disheartening.
Speaking of whether people cite a failed replication or the original study, Dworkin’s policy brief mentions the fact that the value of replications might depend on how often people pay attention to replications:
I agree with this passage to some extent, but where I differ is this: it assumes that the community’s response to a replication study is just a given fact to be assumed (i.e., a “35% reduction [in citations] after a few years”).
But how the community responds to replications isn’t just a fixed and static quantity that no one can possibly change going forward.
A large-scale replication initiative (at the NIH, for example) might not just fund replication studies and then let the citations settle out wherever they may. Instead, it might also include a focused effort to:
Publicize the effects of replication studies via scientific societies and their newsletters
Update PubMed to change any study’s webpage to prominently include the results of any replication
Work with journals like Science and Nature to make sure that their own webpages are updated with replication results.
Announce that the NIH will sharply discount any grant applicants who cite the original paper but not the replication, because they are obviously not up to speed on their own field.
An agency like NIH has many tools to get scientists to pay more attention to replications! The amount of attention paid to replications is not a fixed (and low) quantity that we have to take as given, thereby diminishing the impact of doing any replications at all.
Anyway, these are smaller quibbles in the grand scheme of things. Jordan Dworkin’s piece is excellent, and we need more of this in science/metascience.
From Jordan in response:
First off, thanks for engaging with the piece and thinking critically about it. You’ve raised some great points. We agree on many things, but there are a few I think are worth pushing back on.
Continued-influence and time-elapsed are both important
One key point that you raise is that the temporal aspect of my model should capture something more like “continued influence” rather than a strict “time elapsed” parameter. So if a paper is 15 years old but still getting hundreds of cites per year, there could still be lots of value in replicating it.
I agree with that, and it’s true that the default 15-year trajectory is a bit limiting. But the model doesn’t downweight the value of replicating old papers, per se; it downweights papers whose influence is mostly in the past. An older paper that is still accumulating citations at a high rate will show high expected future impact, and thus high replication ROI. Plugging in parameters that roughly correspond to the Lesné paper — moderate skepticism (say, 33% chance of failure) and 3600 citations after 15 years — still shows 9x returns for a replication at year 10.
But even granting that a Lesné replication at year 10 would have been valuable, it’s important not to discount timing’s role here. We knew that paper was impactful after 5 years, and if you assume a 5-year replication in the model instead of 10 then the ROI is 39x; if you plug in 3 years, it’s 59x.
So I agree that there are older papers that we should still replicate (and I think in many cases the model would still show those delayed replications as high-ROI). But in most of those cases we would have gotten substantially higher returns had we replicated them earlier, so finding those early opportunities should be a large focus of any new program.
Baked-in influence probably makes replications less impactful, not more
You acknowledge the timing point, but also argue, “the mere fact that time has elapsed and direct citations have dropped doesn’t necessarily mean that the paper is any less influential. It could be quite the contrary: The paper’s seminal status is now so embedded in the field that direct citations aren’t capturing anywhere near the amount of influence it has, and direct replication would still be immensely valuable.”
On the broad point about non-citation measures of influence, I agree that citations are limited as a tool here, and think that finding ways to source scientists’ perceptions of a paper’s influence (either instead of or in addition to metrics) would be a better approach for an actual program. But I think we might disagree on the implications of accumulated influence. I’d argue that the more baked-in a paper’s influence becomes, the less impactful a replication will be. Once there are hundreds of follow-on studies and a few subfields driven by a paper, un-ringing that bell takes far more than an independent replication.
In Lensé’s case, it took egregious and unambiguous fraud, layered on top of a slow accumulation of uncertainty based on years of small-scale failures to replicate. A single failed replication paper in 2022 would (in my opinion) have had a small fraction of the power of the fraud investigation, partly because the broader field was so developed that people might not have actually cared that much whether any individual oligomer held up.
An earlier replication, however, might have stood on more equal footing with the original findings. I think the Langer example supports this, in a way: by 2021 hundreds of papers per year were citing Langer as a foundational paper, and the replication couldn’t move the needle. Had the replication happened in 1996, when it was clear that the influence was rising rather than falling, but the scope of influence was still manageable, the trajectory might look very different today.
So if impact compounds in ways that are invisible to citation counts, and the influence-reducing ability of any given replication decreases as a function of a paper’s current influence, then my framework might even be conservative with respect to the importance of timing.
It is definitely true that there are non-citation measures of influence (hype, funding, trials) that might lead a simple model to discount the importance of replicating an older study. And I think that reducing the influence of incorrect foundational studies can be highly valuable if achieved. But I would caution that the presence of that hidden influence might suggest that replication is no longer sufficient for achieving those ends. The Lesné fact pattern — “egregious fraud, in the most widely-discussed scientific domain, with a paper published in Nature and an exposé published in Science” — is not one that most NIH-funded replications will be able to work with.
Maybe we should replicate the descendents
For similar reasons to those above, I’m a bit torn on your stylized model. If follow-on Papers E through Z are cannibalizing citations from Paper A, I think there are cases when it would genuinely be higher impact to replicate those studies rather than Paper A; if Paper A fails to replicate, but I still trust Papers E and F, will I actually change my research?
I admit that the Lesné example cuts against this a bit. Perhaps there’s a distinction between papers whose influence operates mostly at the level of “creating hype and funding for a flawed subfield” and those whose influence operates by “producing a specific finding that lots of others try to build on.” In the former case, replicating the focal paper is probably higher value even if the follow-on papers are cannibalizing citations. In the latter, if papers E through Z are becoming more widely cited in their own right, I think replicating them could genuinely be higher impact. My guess is the latter situation is more common, but instances of the former might be particularly high leverage.
Increasing attention is critical
On your last point about not treating community response as static, we’re in agreement. The model treats the reduction in citations as a fixed parameter mostly because that’s the ecosystem that any new replication program will exist within, and we should be clear-eyed about the impact those replications will or will not have.
But you’re right that 35% is not a ceiling, and it would be very valuable for NIH to try some levers for getting scientists to pay more attention to replications. I would be excited to see, for example, an effort to more cleanly link (high quality) replications to original papers, and I think your PubMed idea is a clever and tractable one.
FROM STUART IN REPLY:
Thanks for these many thoughtful remarks, Jordan! I don’t think we’re that far apart, if at all.
I agree that even if older papers still deserve replication from time to time, it would have been more valuable to replicate them earlier rather than later.
I agree that if a paper’s findings have become embedded in a literature of hundreds or thousands of follow-on papers, then even if the entire literature is spurious, merely replicating the original paper might not be anywhere enough to overturn that literature.
I agree that replicating follow-on papers might well be more high-impact.
And we both agree that NIH and others can do much more to raise the prominence and salience of replication studies.
So, here’s to both of us raising a glass towards more replication at NIH!








Hi, thank you for an interesting post! Visibility of replication studies is definitely an important topic to bring into wider awareness. We are currently working on a project at FORRT, focused on this exact problem: https://forrt.org/marco/. If you are open to it we would welcome feedback and/or to share notes/brainstorm!
A few more comments:
I think it’s wrong to view replication as having a binary outcome — confirmed or disconfirmed. For one thing, that’s not how statistical significance works. There are two potential aspects to replication — increasing the data, and validating the methods. These operate somewhat independently. It’s possible that a replication study produces results that are statistically different from the original study, even though it “confirms” the original findings — that is still an important observation, because it demonstrates that there are significant factors that are not adequately controlled. It’s also possible that a replication study produces results that are not statistically different from the original study (i.e., the hypothesis that the two studies had the same outcome distribution is not disconfirmed) but the results still are less statistically significant than the original study. And it’s also possible that the combined results of the original and the replication study produce more significance or more analyzable data than either alone.
For all of these reasons and others, I also agree with Stuart that it’s wrong to look at the value of replication studies as simply “probability of failure to replicate” times “impact of failure to replicate”. If a study is very influential, there could be significant value in *confirming* it, for example. And just expanding the amount of data and therefore the confidence intervals for outcome variables could also be worth a lot.