How to Improve Both Scientific Innovation and Reproducibility at Once

Nov 04, 2024

This is a rerun of a piece I wrote for the journal Works in Progress four years ago (here).

Science has two stark problems: replication and innovation. Many scientific findings aren’t reproducible. That is to say, you can’t be sure that another study or experiment on the same question would get similar results. At the same time, the pace of scientific innovation could be slowing down.

Does attempting to solve one problem make the other worse? Many have argued that policies seeking to avoid reproducibility issues will create a constrictive atmosphere that inhibits innovation and discovery.

Indeed, top policymakers are worried about just this. Along with other prominent philanthropists and academics, I attended a White House meeting on scientific reproducibility early in 2020 (just before COVID-19 really hit). One of the key questions on a sheet of paper that the White House Office of Science and Technology Policy circulated for discussion was whether a tradeoff existed: Would efforts to improve reproducibility risk harming the creativity and innovation of federally-funded research?

I do not think there’s a contradiction between reproducibility and innovation. Contrary to common belief, we can improve both at once – by incentivizing failed results, and by funding “Red Teams” that would aim to refute existing dogma or would be entirely outside it.

First, though, let’s take a step back, and briefly review the evidence that significant areas of science could be more reproducible and innovative.

Is science reproducible?

Many people have written about scientific irreproducibility over the past several decades. But the issue became more prominent in the mid-2000s with the publication of what soon became one of the most downloaded research papers of all time: The 2005 piece “Why Most Published Research Findings Are False,” by Stanford’s John Ioannidis. (Disclaimer: he is a long-time grantee of Arnold Ventures, where I work.)

To be sure, Ioannidis’s finding was mostly theoretical; it’s not as if he actually redid “most” published research (i.e., tens of millions of studies). Instead, he showed that given the way most studies are carried out, if journals have even a slight bias towards positive results (and they most definitely do), then most of the results that end up getting published would inevitably be statistical flukes or the results of p-hacking.

His theoretical case has been confirmed by many empirical studies in fields from drug development to psychology. Pharmaceutical companies such as Amgen and Bayer have reported that they are unable to reproduce 80+% of experiments from prestigious journals. To quote Bayer’s scientists, “projects that were started in our company based on exciting published data have often resulted in disillusionment when key data could not be reproduced.”

Then there was the Reproducibility Project in Psychology, which we funded, and which was carried out by our grantee Center for Open Science. That project organized well over 200 psychology labs around the world to systematically redo 100 experiments published in top psychology journals. It found that only about 40% could be reliably replicated (another 40% were inconclusive, and around 20% were decisively not replicated). Since those results were published in 2015, the study has already been cited over 4,400 times according to Google Scholar. Many of the most famous results in psychology have turned out to be unreliable and possibly fraudulent (such as Zimbardo’s Stanford prison experiment), and the best recent treatment of this issue is Stuart Ritchie’s 2020 book “Science Fictions.”

To be sure, the problem seems much less acute in harder sciences – e.g., physics, chemistry, cosmology – that have an established tradition of skepticism, replication, or even blinding researchers to their own conclusions. The bulk of the reproducibility and publication bias problem seems to be in social science and biomedicine. In many of those fields and subfields – such as clinical trials in medicine, high-throughput bioinformatics, neuroimaging, cognitive science, public health and epidemiological research, economics, political science, psychiatry, education, sociology, computer science, and machine learning and AI – the published literature features too many false positives as well as conclusions that may well be p-hacked. It’s enough to make folks at the White House, NIH, and NSF worried about the quality of federally-funded science.

Is science innovative enough?

At the same time, numerous observers have pointed to an entirely different problem: science has grown less innovative these days. (And even if it hasn’t, we could always benefit from faster innovation.)

In a recent piece, Patrick Collison, the founder of Stripe, and Michael Nielsen, a theoretical physicist, made the case that the rate of scientific advancement is slowing down in recent years per dollar spent. Based on surveys of noted leaders in physics, chemistry, and medicine, they concluded, “Over the past century, we’ve vastly increased the time and money invested in science, but in scientists’ own judgement, we’re producing the most important breakthroughs at a near-constant rate. On a per-dollar or per-person basis, this suggests that science is becoming far less efficient.”

Collison and Nielsen are far from alone. Cowen and Southwood argue that “there is good and also wide-ranging evidence that the rate of scientific progress has indeed slowed down.” The 2019 paper, “Are Good Ideas Getting Harder to Find?” argues that in semiconductors, agriculture, and medical innovations, “research effort is rising substantially while research productivity is declining sharply.”¹ That paper concludes by predicting that “just to sustain constant growth in GDP per person, the U.S. must double the amount of research effort searching for new ideas every 13 years to offset the increased difficulty of finding new ideas.”

Of course, some of these assessments might be too pessimistic. But it is depressingly common to hear the world’s most innovative scientists lament that they would never have succeeded in today’s academic or funding system because their work was too outside the box:

Roger Kornberg (a Nobel-winning biochemist) told the Washington Post in 2007 that his 1970s research on DNA “would never have gotten the necessary funding” if he had come along in the 2000s: “In the present climate especially, the funding decisions are ultraconservative. If the work that you propose to do isn’t virtually certain of success, then it won’t be funded.”
As reported in 2013, “UC Berkeley molecular biologist Randy Schekman won the Nobel Prize for Medicine with two other scientists this week. But he says the kind of basic science research that led to his prize might have never gotten funded if he were applying for grants today.”
David Deutsch, who pioneered quantum computing, says that he would never have gotten his “first research grant on quantum computers . . . under today’s criteria.”
Peter Higgs, the Nobel Laureate for whom the Higgs Boson is named, “believes no university would employ him in today’s academic system because he would not be considered ‘productive’ enough. . . . ‘Today I wouldn’t get an academic job. It’s as simple as that. I don’t think I would be regarded as productive enough.’”

When so many top scientists say that their own work would never have passed muster in the current system, we must take stock of the current system. As prominent scientists have asked, “How successful would Silicon Valley be if nearly 99% of all investments were awarded to scientists and engineers aged 36 years or older, along with a strong bias toward funding only safe, non-risky projects?” Moreover, a common complaint is that “scientists are forced to specify years in advance what they intend to do, and spend their time continually applying for very short, small grants” – hardly a system that would encourage innovation.

In short, we have evidence that US science funding is often fairly tame and incremental, that some of the most innovative science of the past would never have been funded by today’s bureaucracy, and that scientific review panels are dominated by insiders.

Thus, innovation in science is imperiled. If Einstein had to navigate such a system, we might never have heard of relativity. And even if innovation weren’t slowing down per se, we could still do better.

What next?

There are lots of ideas about how to improve scientific reproducibility in how federal research is funded. After all, quality control and assurance are hardly new ideas.

For example, we could require that data and computer code be shared openly so that others can scrutinize and rerun it. In too many cases to list, this sort of reanalysis has led to revisions, retractions, and even the discovery of outright fraud.

Next, we could require that experiments and other empirical studies be pre-registered, so that the analysis and results are less likely to be cherry-picked later. We already do this for clinical trials in medicine, and a review of federally-sponsored clinical trials found that the rate of positive results went down dramatically as soon as researchers were required to pre-register their studies. We could do the same for much else in science. We could even move towards more widespread use of the Registered Reports format, in which journals accept an article for publication before the final results are even available.

It’s less obvious how to reform government funding so as to improve scientific innovation. Let’s try a thought experiment:

Imagine that you were the President 100 years ago, instead of Woodrow Wilson. Imagine that a time-traveling genie from the future tells you that over the next hundred years, there will be an astonishing number of inventions and scientific discoveries – treatments for diabetes and simple infections, vaccinations for diseases that currently kill or disable many people, automobiles that will be used by the millions, machines that will fly across the ocean and even to other planets, television, submarines, computing machines, handheld telephones, nuclear energy, satellites that will orbit the earth, genetics, and much, much more.

You then say to yourself, “This is all well and good, but I’ll be long dead in 100 years. If all of this scientific advancement is going to happen, I want to find a way to speed it up.”

Now in the year 1920, significant science funding didn’t yet exist. Today, of course, we have the National Institutes of Health (NIH) and the National Science Foundation (NSF), which are collectively funded at some $45 billion a year. But those agencies wouldn’t exist until 1930 and 1950, respectively.

So, as President in 1920, you decide to create governmental scientific funding. How would you do so such that, over the next hundred years, the average scientific discovery or invention will occur a mere five years earlier than it would have otherwise? If that’s too hard, how would you make just one scientific discovery occur five years earlier?

Even with the benefit of hindsight, this might seem a difficult question. Some of the most well-known scientific discoveries were serendipitous: Alexander Fleming’s discovery of penicillin; Wilhelm Roentgen’s discovery of X-rays; Archimedes’ bath in which he realized how to measure the volume of irregularly-shaped objects.

It’s hard to predict serendipity. And serendipitous or not, you can’t fully anticipate a future scientific discovery, or else you would have already made that discovery right now.

But can we at least create the conditions in which scientific discoveries will occur more frequently? Better yet, can we do so while still improving scientific reproducibility?

Possible but unlikely solutions

One of the most common ideas is to “fund the person, not the project.” In other words, scientific innovation thrives when the best scientists have the freedom to follow their instincts, without being tied down to a particular proposal designed to satisfy an external bureaucracy. Thus, if you want to fund the most innovative science, you should look for the best people and then give them several years of funding to do what they want.

This idea makes some sense. One famous paper argues that the Howard Hughes Medical Institute successfully uses this model to support more innovative biomedical research than the NIH does, while another paper argues that a small NIH program along the same lines was a success. And Alan Kay, the eminent computer scientist, has written that the original funding that developed the Internet was based on two principles: “visions rather than goals,” and “fund[ing] people, not projects.”

While there’s a place for “funding people over projects,” it is unlikely to work for scientific funding at scale. I worry that handing out $40+ billion a year that way could create more groupthink than ever seen before. Younger scientists would need to play an extreme version of office politics in order to be seen as one of the promising “people” who get funding.

Others have suggested that we look to the wisdom of the crowds, by giving a broad spectrum of scientists the ability to allocate some funding to other scientists that they think are particularly promising. Indeed, the Dutch government is piloting such an approach. But it’s hard to see why this approach wouldn’t turn into a popularity contest that wouldn’t improve either innovation or reproducibility.

Still others have argued that since prior scientific discoveries have been so unpredictable, and since there is little to no evidence that peer review works as deployed by the NIH and others, we should just admit that we don’t know what we’re doing, and expressly leave it up to chance. That is, scientific research proposals that pass a fairly low threshold of quality should be entered into a lottery to determine which ones get funded. Indeed, major funding agencies in New Zealand and Germany have been experimenting with lottery-based funding for at least some grants.

Again, while there’s a place for this idea, it’s hard to see why it would work for more than a handful of grants. Scientists need at least the possibility for stable and continued funding over a long period of time. Hardly anyone would go into science if their entire career depended on a repeated lottery with a small chance of winning, rather than on their own effort at doing good science.

But as soon as you allow prior lottery winners to renew grants based on their scientific progress, you’re back to square one: how do you best determine scientific progress? It’s a bit nihilistic to think that we can do no better than a coin flip on that question.

A side note — while I voice some skepticism above about how some funding mechanisms might work, I enthusiastically support the idea that large funders (e.g., NIH) should do one or more randomized experiments in which millions or even billions of dollars are allocated in different ways so as to test the results. It makes no sense to demand more rigor and evidence from every $100k research project than from our entire system for allocating $40+ billion in funding.

My proposed solutions

There are two ideas that could increase both reproducibility and innovation, thus killing two birds with one stone (actually the same two birds with each of two stones). First, we need to demand more null results from all the science we fund. Second, we need to “Red Team” all of science. Let’s dig in.

Demanding null results

We are all biased towards positive and exciting results. This is understandable: A drug that cures cancer is more exciting than a drug that doesn’t. An education intervention that reduces high school dropout by 50% is more exciting than one that does nothing. A technique for improving marital happiness is better than one that leaves everyone as unhappy as before. This is all reminiscent of how we are biased towards high-calorie foods (almost all addictive foods – such as potato chips, ice cream, doughnuts, French fries, etc. – combine high fat and high carbohydrates).

But just as a bias towards high-calorie foods messes up our eating habits now that such foods are available 24/7, a bias towards positive results distorts the entire scientific process now that science has become a major industry. Reviews of scientific literature typically find that across all the major research fields, the published results are 70% to 90+% positive.

That’s a huge problem! There are only three ways for a scientist to guarantee positive results:

Be a psychic;
Study only marginal, incremental topics where the path forward is clear, and you can virtually guarantee a positive result; and/or
Skew your research design, data, and analysis, and hide any results that are still null.

Let’s rule out the possibility that a majority of researchers are psychics. The other two methods of getting all-positive results are a threat to innovation and/or reproducibility.

In science, just as in everything else (finance, etc.), there is a risk-reward tradeoff. Low-risk projects come with low rewards. High-reward projects are more risky and likely to fail. Sadly, we don’t live in a universe where it is generally possible to engage in activities that are both low-risk and high-reward.

We need to stop acting as if science can evade this inevitable risk-reward tradeoff by delivering results that are groundbreaking yet predictably successful. Nobel winner William Kaelin wrote earlier this year, “Today, federal research funding is increasingly linked to potential impact, or deliverables, and basic scientists are increasingly asked to certify what they would be doing with their third, fourth and fifth years of funding, as though the outcomes of their experiments were already knowable.”

What do you get if you demand substantial impact from projects that are predictable several years out? The worst of all worlds: low-risk, marginal projects dressed up as if they had high impact. In other words, science that isn’t very innovative, yet that is described with flashy, irreproducible claims.

We need to start demanding null results. Each federal agency should reorient its peer review and grant renewal processes to require that a certain percentage of research projects will “fail” or produce null results. (We the public could also stop showering acclaim, TED talks, etc., on scientists with glamorous results.)

A clear expectation that most research projects will fail or produce null results would empower scientists both to take creative risks (rather than studying incremental topics), and to avoid p-hacking by telling the full truth about their research (however messy or null).

Conversely, if too many research projects turn up with positive results, that should be seen as a cause for investigation, not celebration. Some of the most famous examples of fraud – the psychologist Diederik Stapel, for example – were well-known for always producing impressive, positive results.

What should the proper rate of null results be? In cases where we know the full body of studies on a given issue, it’s typical for up to 90% of them to have null results. For example, out of 90 education interventions evaluated by federally-funded RCTs, only about 10% had positive results.

At the other end of the spectrum, consider Phase III clinical trials (the final stage before FDA approval). A comprehensive paper shows that only about 59% of Phase III trials succeed.

This is the maximum rate of positive results one ought to see. After all, by the time of a Phase III clinical trial submitted to the FDA, a pharmaceutical company may have spent several years and a billion or more dollars on lab tests, extensive animal testing, and the earlier stage human trials. Even with all of that evidence that a drug will work, the most rigorous trials still fail 40+% of the time. In almost all other areas of research, no one will have spent many years and billions of dollars trying to guarantee that the effect in question will be replicable.

In short, federal funding agencies should emphatically stop expecting prior results that essentially guarantee future success. Future success cannot be guaranteed without studying incremental topics and/or rigging the science. We need to demand a certain percentage of null results, and even investigate any scientist or federal funding agency whose results are too uniformly positive.

Red team all of science

As humans, we are prone to groupthink. Like our bias for positive results, the bias towards groupthink is understandable. The world is full of more information than any single person can possibly comprehend. It makes sense that when we ask, “What is sensible to believe?,” we would usually take our cues from what everyone else believes at the time.

This sort of groupthink hurts reproducibility because any scientist who comes up with results that aren’t consistent with current groupthink is incentivized to hide the results, redo the experiment, distort the results until everyone thinks they are similar to the consensus view, or just change his or her approach altogether so it fits within what it currently considered fundable. My friend Saul Perlmutter has written about how groupthink has even affected estimates of seemingly objective measures like the charge of an electron or the lifetime of a neutron.

Groupthink hurts innovation as well. It isn’t an accident that so many scientific discoveries throughout history – including ideas that we now think obvious, such as the circulation of blood or the danger of germs – were disbelieved or treated as heretical at the time. As Max Planck famously quipped, science advances one funeral at a time — a quip that recently got some impressive empirical support from a study by Pierre Azoulay and colleagues (it turns out that when famous scientists die, the subfield in which they worked sees a flowering of new scholars and publications compared to other subfields).

Scientists should be free to pursue the data wherever it leads, not be held to the current consensus of peer reviewers, which can be limited or sometimes outright wrong. Consider how groupthink played out in the search for a cure for Alzheimer’s disease. As documented in a great article by Sharon Begley in STAT, “scientists whose ideas fell outside the dogma recounted how, for decades, believers in the dominant hypothesis suppressed research on alternative ideas: They influenced what studies got published in top journals, which scientists got funded, who got tenure, and who got speaking slots at reputation-buffing scientific conferences. This . . . is a big reason why there is no treatment for Alzheimer’s.” Only now that so many drugs treating beta-amyloid have failed are scientists finally willing to consider that their theory was perhaps incomplete or even wrong.

How can we best create a space for an alternative to groupthink? By “red teaming” all of science.

Red teaming is the term that the military and intelligence communities use when they task a group of people (called the “red team”) with trying to attack and refute something like a strategy for battle or an intelligence assessment of an enemy capability. Indeed, the United States Army published a 238-page book called, “The Red Team Handbook: The Army’s Guide to Making Better Decisions.”

Since everyone is prone to groupthink and confirmation bias (often punishing anyone who goes against the consensus), we need to specifically empower some people to be an antagonist, with the explicit role of trying to refute, attack, and discredit other scientists and their theories. If they do a good job and show that the current consensus is wrong, nobody ought to be resentful – that was their direct remit, after all. As the United States Joint Chiefs of Staff said, red teams “help commanders and staffs think critically and creatively; challenge assumptions; mitigate groupthink; reduce risks by serving as a check against complacency and surprise; and increase opportunities by helping the staff see situations, problems, and potential solutions from alternative perspectives.”

Some scholars and commentators have recently recommended putting out individual papers for a “red team” review. Indeed, in one case, a team of scholars literally paid a team of five outside experts to find errors in a new manuscript — $200 flat per expert, plus an additional $100 bounty for each major error that someone found, up to a total of $3,000. Similarly, the scientist Stuart Ritchie has announced that he will pay anyone who finds an objective error in his book Science Fictions.

This is a great start, but many scholars won’t have an extra $3,000 sitting around to boost the quality of each article. And ironically, scholars who put up their own money for a “red team” might be the least likely to need it – after all, they are already motivated to do quality science. It’s the scientists who would never dream of soliciting rigorous criticism that we need to worry about.

What we need is something much broader and systematic, with the institutional heft to red team the rest of science, as necessary. (We don’t need to red team everything – many scientific articles aren’t influential enough to be worth bothering about.)

Let’s imagine launching a new federal institute – call it the National Institute for Innovation and Replication – with its own budget and statutory authority independent from other federal agencies.

Its mission would be to provide a counterweight to the rest of biomedical and social science, in two ways:

First, it would sponsor independent replication projects as to influential papers and projects funded elsewhere. Such projects are otherwise hard to fund, but provide an important check on the reproducibility of science.

Second, the Institute would provide numerous streams of funding on important scientific questions where the traditional sources of funding are arguably affected by groupthink and confirmation bias, or where promising lines of research aren’t politically popular at the moment. For example, such an institute would have funded scientists with new ideas as to the cause and treatment of Alzheimer’s disease. And over the past two decades, it would have provided funding for coronavirus studies, which were relatively neglected when times of crisis (e.g., MERS or SARS) had passed.

The Institute could also establish a new publication to serve as an alternative to the likes of Science and Nature. (It could even be called “Anti-Science” or “Anti-Nature”!) The journal would publish articles that specifically challenge other high-impact publications, either by replicating them or by offering alternative theories.

Improving reproducibility and innovation isn’t easy, to be sure. But science policy and science funders could do both at once by demanding more null results, and by substantially funding efforts to contradict groupthink and confirmation bias. And this would help all of society get more value out of the many billions of dollars that we collectively spend on science every year.

Josef M. Klein

Nov 4, 2024Edited

I really like the Red Team suggestion. It could help a lot to rile up stagnant fields that are stuck in echo chambers.

I find it very useful in my (professional) life to skip failure, reach results of higher quality or simply kill ideas of plausible misinformation much faster.

Expand full comment

The Good Science Project

Discussion about this post