Why Peer Review Should Be More Like Meteorology

Oct 10, 2022

We’re all familiar with meteorologists’ forecasts: a 5% chance of rain, a 95% chance of rain, etc. It would be far less useful if you turned on the Weather Channel only to hear on every occasion that there were two possible forecasts: “Rain will happen today,” or “rain will not happen today,” with the forecast being wrong half the time.

It would be only slightly more useful if the forecasters used a scale of 1 to 9: “Chance of rain today is a 7” versus “chance of rain is a 5.” What does any of that actually mean? Should you carry an umbrella or not? Turn off the yard sprinklers? Plan for extra time on the drive to a meeting?

This is not unlike the situation we’re in with proposal review at the moment.

[For the rest of this essay, I’m calling it “proposal review” to encompass both peer review at agencies like NIH and NSF, as well as the review of proposals at DARPA and ARPA-H that might not technically have to use "peer review."]

Proposal reviewers are constantly expected, even if implicitly, to make predictions about the success of a particular research proposal. But to my knowledge they are not asked to provide actual probabilities, let alone probability distributions of likely impact.

Instead, they’re asked to make judgments about whether a proposal is “good” or “bad”, or to rate proposals on a scale from 1 to 9, or something else that doesn’t actually require them to estimate and articulate probabilities of success.

As an experiment in enhancing proposal review, science funding agencies ought to try having peer reviewers or other evaluators assign quantitative percentages to a proposal's likelihood in reaching a particular milestones or technical goals.

This idea might work better for some research areas (goal-directed projects, clinical trials, etc.) than for others (e.g., basic exploratory science), but focusing on probabilities would likely benefit all proposal review.

How might this look in practice? Reviewers would likely show some independent and diverse judgment (Lee et al., 2013). For example, Reviewer 1 gives a proposal a 50% chance of success in reaching milestone 1, while Reviewer 2 gives it 85%, and Reviewer 3 says it only has a 25% chance.

These kinds of judgments can then be scored over time. Better yet, reviewers could provide confidence intervals, e.g., “I’m 95% confident that the probability of success is between 40% and 60%.”

It might turn out that Reviewer 1 is pretty good (or is “well-calibrated” in forecasting parlance) in that the proposals that she predicts to have a 50/50 chance generally turn out to succeed (or fail) half the time. On the other hand, Reviewer 2 might turn out to be biased towards positive judgments, assigning higher probabilities to proposals that don’t succeed very often, indicating he should perhaps be more conservative in his estimates. Likewise, Reviewer 3 might be far too negative, underestimating how successful proposals turn out to be.

The point is not to expect perfect accuracy from all reviewers, although that would be nice. The point, instead, is that by requiring reviewers to make probabilistic forecasts, funding agencies can identify skilled reviewers, and reviewers can use their feedback to improve their own performance.

Here are some of the potential benefits:

Everyone would have to be more precise in their evaluations. It would not be sufficient to say “this proposal is good, bad, or ok,” but would require a reviewer to specify the actual chances that the approach would work or not. This would help reviewers to speak a common language with each other during proposal evaluations, as well as allow an agency to see the spread of forecasts among reviewers for any given proposal.
Forecasting could improve both agency-wide and individual reviewer calibration over time through consistent feedback and review. Feedback loops are essential, after all, for improving performance (e.g., Goetz, 2011). If you’re constantly making implicitly predictive judgments, but never have to provide explicit probabilities and thus can’t receive feedback as to whether your forecasts were accurate, you will miss out on many opportunities to improve.
An agency would be able to visualize its overall portfolio with the proportion of its funding on research or projects that are truly risky (perhaps with under 20% chance of success), somewhat risky (perhaps closer to 50%), and – where risk may be less appropriate -- perhaps verge closer to slam dunks. In an agency that promotes a “high-risk, high-payoff” mission, funding proposals that on average have an 80% chance of success might be overly conservative. Other funding agencies might want a different risk profile, but probabilities would help them both in establishing what that means and whether they’re aligned with that profile.
Forecasting could also help reduce (or at least reveal) hindsight bias (Roese & Vohs, 2012). When a project fails, people will commonly opine that they always suspected that it was never going to work. This can lead to an impression that the investment was somehow foolish, wasteful, or unnecessary. But if reviewers had earlier given that project a 90% chance of success, and it failed, reviewers and organizations examine their perhaps unwarranted optimism. Forecasting can turn failed projects into learning opportunities. At the other end of the spectrum, success will often invite the response, “well, we always knew that would work,” again suggesting that the research was unnecessary or somehow self-evident. Probabilistic forecasts would help temper that reaction, however, had the project been assigned an average of 30% chance of success during proposal evaluation.
We could also test the effect of peer review discussions, which can potentially create peer pressure to conform to the judgments of other reviewers (Lane et al., 2021; Pier et al., 2019). For example, we could get reviewers to make a probabilistic forecast while blinded to what any other reviewer thinks; then we could gather the reviewers for a discussion (such as an NIH study section); and finally we could ask them to make a second probabilistic forecast. This would allow for tests of which forecasts are more accurate—independent or peer-influenced? Organizations could then adopt the more accurate procedures.

Given these potential benefits, the next obvious question is why the approach hasn’t been widely adopted or at least attempted. Searching the published literature and consultations with members of the forecasting community revealed nothing.

This absence of evidence is puzzling, but -- like the proverbial economist who turns up his nose at a $20 bill on the sidewalk because it must be fake (otherwise someone else would have picked it up) – we shouldn’t assume that it’s also evidence of absence. If such a presumably easily implemented and seemingly useful idea hasn’t been tried, maybe there’s a good reason.

One reason might be that some reviewers feel that they can’t be more precise than to put proposals into, say, three buckets of predicted performance: “probably not; 50/50; and most likely.” To make them assign a probability of 37% or 68% might feel like an artificial level of precision. After all, why 37% and not 38% or 36%? Who could possibly explain that kind of fine distinction?

This is a fair point, but if reviewers want to round up (or down) their probability assessments to the nearest 10, or 25, or any other percentage, the beauty of forecasting is that they are free to do so. They need only be somewhat rational (give higher probabilities to proposals they think are more likely to succeed) – and be willing to assign probabilities that can be scored.

This seems worthwhile: if a reviewer claims to be “skeptical” about a proposal, we should try to elicit a more exact probability. Maybe the reviewer has very stringent internal standards, and anything short of a 90% chance of success makes him or her skeptical. Alternatively, maybe the reviewer’s skepticism maps to a 20% chance of success.

Requiring the reviewer to articulate a probability – any probability – makes it easier for others to understand what the reviewer actually means (and potentially to help the reviewer consider what he or she means as well)

Another reason for resistance may be that reviewers are uncomfortable with their performance being judged, which may reflect the organizational or scientific culture in which they operate.

For example, the CIA was infamously uncomfortable with Sherman Kent’s efforts to make analysts provide probabilities with their analytic judgments. Instead of judging the likelihood of an event being “possible, with medium confidence,” Kent wanted analysts to put a probability on their judgments. When someone accused him of subsequently trying to turn the CIA into a betting shop, Kent’s response was, like himself, rather legendary: “I’d rather be a bookie than a goddamned poet” (Ciocca et al., 2021).

While that sentiment may be a bit harsh, it is not wrong. If people – reviewers, program officers, agency heads -- are determining whether and how to spend millions or billions of federal dollars, they should care about improving their own calibration and judgment.

Indeed, I would argue that NIH and NSF should systematically capture the judgment and calibration of all reviewers and program officers. I could even imagine an ongoing set of awards: “Top-scoring NIH peer reviewer,” “top-scoring SRO,” and the like. This would further incentivize reviewers to make more accurate predictions.

If the National Weather Service's Global Forecast System started proving to be inaccurate (e.g., rain occurred only 70% of the time when the model predicted 80%), meteorologists would be able to jump into action to recalibrate the models, check the data inputs, and draw from a long historical database of past forecasts and outcomes.

But they can only do this because they have a system and a culture of keeping score.

Right now, scientific funders can’t do anything comparable. As Fang and Casadevall wrote:

At its extremes, the error and variability in the review process become almost laughable. One of our colleagues recently witnessed an application to receive a perfect score of 1.0 when part of a program project application, but the identical application was unscored as a stand-alone R01. Almost no scientific investigation has been performed to examine the predictive accuracy of study section peer review. . . . Putting a stronger scientific foundation and accountability into peer review would enhance confidence in the system and facilitate evidence-driven improvements

As the best way to get ahead is to get started, we could all benefit from being a little more like meteorologists in this respect.

Thanks to Cory Clark and Phil Tetlock for helpful comments.

References

Ciocca, J., Horowitz, M., Kahn, L., & Ruhl, C. (2021, May 9). How the U.S. government can learn to see the future. Lawfare. https://www.lawfareblog.com/how-us-government-can-learn-see-future

Goetz, T. (2011, June 19). Harnessing the power of feedback loops. WIRED. https://www.wired.com/2011/06/ff-feedbackloop/

Pier, E. L., Raclaw, J., Carnes, M., Ford, C. E., & Kaatz, A. (2019). Laughter and the chair: Social pressures influencing scoring during grant peer review meetings. Journal of General Internal Medicine, 34(4), 513-514.

Lane, J. N., Teplitskiy, M., Gray, G., Ranu, H., Menietti, M., Guinan, E. C., & Lakhani, K. R. (2022). Conservatism gets funded? a field experiment on the role of negative information in novel project evaluation. Management Science, 68(6), 4478-4495.

Lee, C. J., Sugimoto, C. R., Zhang, G., & Cronin, B. (2013). Bias in peer review. Journal of the American Society for Information Science and Technology, 64(1), 2-17.

Roese, N. J., & Vohs, K. D. (2012). Hindsight bias. Perspectives on Psychological Science, 7(5), 411-426.

Luis Pedro Coelho

Oct 20, 2022

I think the problem with this line of thinking is that it's very difficult to even after the fact accurately capture whether a project succeeded without a large dose of human judgement.

"We will use method X to improve process Y (that currently uses Z)."

Actual outcome: a paper showing that in certain conditions, X produces the same results as Z, but is 7% cheaper.

Is this a success? Who knows? Even absent fraud, you need human judgement to know whether (1) the conditions are relevant ones in practice and (2) is 7% a big deal or not (and maybe X has other types of cost)

Also, one could push the system towards projects where the outcomes are fuzzier to define, even though projects with a high probability of outright failure are also often very clearly worth doing. For example, a clinical trial with probability of success ~5% is often higher ROI than fuzzy exploratory research (and I'm a "fuzzy exploratory researcher" myself!)

Expand full comment

1 reply

Darren Zhu

Oct 12, 2022

Great piece! I've been thinking about similar approaches to measuring reviewer performance and reputation -- forecasting is one good way! Editors of journals have to do similar soft forecasting when accepting papers (via projected impact factor), so as not to dilute their journal's reputation.

One challenge that makes it even more difficult for proposal reviewers is that there isn't necessarily a clear sense of how to define success (especially for basic exploratory research). For example, if a project doesn't deliver on an initial milestone but does open up the investigator to a new line of inquiry, should that be considered a success?

2 more comments...

The Good Science Project

Discussion about this post