Mechanisms for peer review and grant-awarding

Peer review is the hallmark of the scientific system, yet is weighed down with many problems. In the words of Drummond Rennie: “Peer review is touted as a demonstration of the self-critical nature of science. But it is a human system. Everybody involved brings prejudices, misunderstandings, and gaps in knowledge, so no one should be surprised that peer review is often biased and inefficient. It is occasionally corrupt, sometimes a charade, an open temptation to plagiarists. Even with the best of intentions, how and whether peer review identifies high-quality science is unknown. It is, in short, unscientific.”

These issues have become more pertinent with the ever-increasing number of journals and funding agencies needing to draw more referees, from a pool of scientists overwhelmed with growing demands for their own publications and funding. Indeed, a whole research field has burgeoned around how we can solve the complexities posed by this system.

The good news is that the problem is tractable, and improvements are possible. In his Future of Science seminar, Davide Grossi explored the potential solutions from a mechanism design perspective, giving us a systematic overview of each step in peer review and a series of experiments that might improve each.

A story of peer review

A visual depiction of how we might use Borda scores to assess the best papers to include in a conference.

Telescope Time Without Tears described how the peer review process for access to a highly popular telescope exhausted and overwhelmed the astronomers responsible for the selection. Their process faced all of the limitations of traditional peer review: lack of scientific rigour, biases that dismissed potentially groundbreaking research and accepted fraudulent ones, and an increasing volume of submissions for a limited reviewer pool. All with high stakes tied to their decisions: each proposal they accepted or rejected could determine the course of both their academic careers and those of the applying researchers.

Dr. Grossi divided the solutions to these issues into assignment, evaluation, and aggregation.

Act one: Assignment

This phase is where editors match reviewers with papers. We have two problems here. First, very few reviewers accept the responsibility of peer review; in the telescope example, only the researchers affiliated with the telescope were reviewing. Second, we have the issue of potential mismatching. How do we know who should review which proposal?

In Telescope Time without Tears, they distributed reviews to everyone who applied to use the telescope, requiring each applicant to review three other proposals as part of their application.

Now, onto problem two. We might use a similarity score or semantic text matching to select the most qualified reviewer to review. The most relevant researchers would be those who publish in a similar field, and we can match the most closely matched papers with each other.

A limitation emerges within this mechanism - if we optimise for the closest matches between papers or applications, there might be 'losers' where one paper or proposal gets exceptionally poorly matched reviewers. Grossi proposes an alternative approach for a more egalitarian assignment for this case. This alternative method maximises the worst match score between a submission and a reviewer. This mechanism implies that while we may reach fewer 'exceptionally matched' papers and reviewers, even the least well-matched paper-reviewer pair is as high as possible.

Act Two: Evaluation

In the evaluation phase, reviewers evaluate the submissions they matched with. A core problem in this stage is potential manipulation: Reviewers might submit evaluations to strategically influence the outcome of their own submission, e.g. by giving low scores to direct competitors. Davide introduced a mechanism that gets around this problem:

We can partition the submissions into distinct groups so that everyone evaluates submissions from a set that does not include their own submission.

Suppose we were to do this for the telescope example. In that case, we'd have a series of proposals come in, and we could separate the proposals and their authors into two groups: one for the upcoming three months, group A, and one for the three months after that, group B. Group A authors would review only submissions from group B, and vice versa.

This partitioning method offers a way to distribute the reviewing workload equally and eliminates the incentive for referees to behave strategically.

Act three: Aggregation

Finally, we need to aggregate the outcomes of the received reviews from the various sets. The core difficulties here emerge from how individual reviewers assign a specific score to a submission, which could be biased by things like the mood of the referee or whether the proposal was scored before lunch or afterwards.

Davide discussed two main approaches to address this problem:

Miscalibration Models: One approach involves linear miscalibration models to adjust for the variations in reviewer assessments. These models attempt to quantify and correct the discrepancies in scoring by considering the reviewing process as a stochastic (random) process. However, these models assume that the true paper score is linearly related to the reviewer scores. This assumption doesn't account for random noise, arbitrary subjectivity, or other inconsistencies in peer review.
Ranking as an Alternative: The alternative approach that Davide suggested is to use rankings instead of scores. Instead of assigning numerical scores to submissions, reviewers would rank the proposals they review against each other. This approach relies on the relative ordering of proposals, helping to mitigate the impact of miscalibration on the overall evaluation. Focusing on rankings, which provide an ordinal comparison, aims to capture the relative merit of submissions without being overly affected by the absolute values assigned by different reviewers. Ranking and aggregating could work like this: The committee chooses M - the number of reviewers a single submission should get. Then, each reviewer ranks the set of papers they need to review. Once the rankings are submitted, the committee uses a ‘Borda Score’ to aggregate these reviews. If M = 3 (i.e. each submission gets three reviews), the top-ranking submission gets a score of M-2 by the referee, the next gets M-1, and the last gets 0 points. To arrive at the aggregated decision, you fix the number of submissions you want to accept, all submissions are ranked in decreasing order by their summed Borda Scores across referees, and all submissions above the cut-off are accepted. More sophisticated versions of this basic idea have already been developed, but these advanced versions haven’t been piloted yet.

What's next in Peer Review?

The past two decades have seen an encouraging uptake of experimentation around peer review: open, blinded, pre- and post-publication, and so on. Scientists have started to look at peer review through the lens of science, new journals have emerged dedicated to the topic (e.g. Research Integrity and Peer Review), and meta-science centres and initiatives have been created that take these issues seriously (e.g. METRICS at Stanford). But it is still early days. More theory, controlled experiments, and empirical evidence are needed. A brighter future may lie in replacing the peer-review system we’ve built with one based on scientific principles and robust empirical evidence.

Mechanisms for peer review and grant-awarding

Davide Grossi

University of Groningen