I think the problem with this line of thinking is that it's very difficult to even after the fact accurately capture whether a project succeeded without a large dose of human judgement.
"We will use method X to improve process Y (that currently uses Z)."
Actual outcome: a paper showing that in certain conditions, X produces the same results as Z, but is 7% cheaper.
Is this a success? Who knows? Even absent fraud, you need human judgement to know whether (1) the conditions are relevant ones in practice and (2) is 7% a big deal or not (and maybe X has other types of cost)
Also, one could push the system towards projects where the outcomes are fuzzier to define, even though projects with a high probability of outright failure are also often very clearly worth doing. For example, a clinical trial with probability of success ~5% is often higher ROI than fuzzy exploratory research (and I'm a "fuzzy exploratory researcher" myself!)
Another issue might be a reviewer's behaviour-drift timescales might be comparable to the length of any reviewed project (e.g. perhaps both being of the order of 3-5 years). So all your reviewer X data, which is already rather sparsely sampled, I'd have thought, only represents their past attitudes, interesting-topic opinions, and biases; and not their current ones.
Still, maybe a %success is better than a rating from 1-5, or whatever. But how would the idea even work for proposals where "give me money to do this interesting thing" (e.g. theoretical physics) has no performance metric but "they have [not/partly/all/over] done the interesting thing", but where the thing is not really measurable as x% faster/longer/etc, but is just a new understanding? And where "partly" might have even been actually a remarkable success give how hard the thing turned out to be? It seems to me there is going to be a lot of human judgement when determining the reviewer performance, as well as rating the project outcome.
Great piece! I've been thinking about similar approaches to measuring reviewer performance and reputation -- forecasting is one good way! Editors of journals have to do similar soft forecasting when accepting papers (via projected impact factor), so as not to dilute their journal's reputation.
One challenge that makes it even more difficult for proposal reviewers is that there isn't necessarily a clear sense of how to define success (especially for basic exploratory research). For example, if a project doesn't deliver on an initial milestone but does open up the investigator to a new line of inquiry, should that be considered a success?
I think the problem with this line of thinking is that it's very difficult to even after the fact accurately capture whether a project succeeded without a large dose of human judgement.
"We will use method X to improve process Y (that currently uses Z)."
Actual outcome: a paper showing that in certain conditions, X produces the same results as Z, but is 7% cheaper.
Is this a success? Who knows? Even absent fraud, you need human judgement to know whether (1) the conditions are relevant ones in practice and (2) is 7% a big deal or not (and maybe X has other types of cost)
Also, one could push the system towards projects where the outcomes are fuzzier to define, even though projects with a high probability of outright failure are also often very clearly worth doing. For example, a clinical trial with probability of success ~5% is often higher ROI than fuzzy exploratory research (and I'm a "fuzzy exploratory researcher" myself!)
Another issue might be a reviewer's behaviour-drift timescales might be comparable to the length of any reviewed project (e.g. perhaps both being of the order of 3-5 years). So all your reviewer X data, which is already rather sparsely sampled, I'd have thought, only represents their past attitudes, interesting-topic opinions, and biases; and not their current ones.
Still, maybe a %success is better than a rating from 1-5, or whatever. But how would the idea even work for proposals where "give me money to do this interesting thing" (e.g. theoretical physics) has no performance metric but "they have [not/partly/all/over] done the interesting thing", but where the thing is not really measurable as x% faster/longer/etc, but is just a new understanding? And where "partly" might have even been actually a remarkable success give how hard the thing turned out to be? It seems to me there is going to be a lot of human judgement when determining the reviewer performance, as well as rating the project outcome.
Great piece! I've been thinking about similar approaches to measuring reviewer performance and reputation -- forecasting is one good way! Editors of journals have to do similar soft forecasting when accepting papers (via projected impact factor), so as not to dilute their journal's reputation.
One challenge that makes it even more difficult for proposal reviewers is that there isn't necessarily a clear sense of how to define success (especially for basic exploratory research). For example, if a project doesn't deliver on an initial milestone but does open up the investigator to a new line of inquiry, should that be considered a success?
That would be an improvement; peer review seems to be one of these systems that has followed a winding path to a place that no one would choose if starting from scratch, and yet no one can alone break its hold: https://jakeseliger.com/2020/05/24/a-simple-solution-to-peer-review-problems/