Peer review is the hallmark of the scientific system, yet is weighed down with many problems. In the words of Drummond Rennie: “Peer review is touted as a demonstration of the self-critical nature of science. But it is a human system. Everybody involved brings prejudices, misunderstandings, and gaps in knowledge, so no one should be surprised that peer review is often biased and inefficient. It is occasionally corrupt, sometimes a charade, an open temptation to plagiarists. Even with the best of intentions, how and whether peer review identifies high-quality science is unknown. It is, in short, unscientific.”
These issues have become more pertinent with the ever-increasing number of journals and funding agencies needing to draw more referees, from a pool of scientists overwhelmed with growing demands for their own publications and funding. Indeed, a whole research field has burgeoned around how we can solve the complexities posed by this system.
The good news is that the problem is tractable, and improvements are possible. In his Future of Science seminar, Davide Grossi explored the potential solutions from a mechanism design perspective, giving us a systematic overview of each step in peer review and a series of experiments that might improve each.
A story of peer review
Telescope Time Without Tears described how the peer review process for access to a highly popular telescope exhausted and overwhelmed the astronomers responsible for the selection. Their process faced all of the limitations of traditional peer review: lack of scientific rigour, biases that dismissed potentially groundbreaking research and accepted fraudulent ones, and an increasing volume of submissions for a limited reviewer pool. All with high stakes tied to their decisions: each proposal they accepted or rejected could determine the course of both their academic careers and those of the applying researchers.
Dr. Grossi divided the solutions to these issues into assignment, evaluation, and aggregation.
Act one: Assignment
This phase is where editors match reviewers with papers. We have two problems here. First, very few reviewers accept the responsibility of peer review; in the telescope example, only the researchers affiliated with the telescope were reviewing. Second, we have the issue of potential mismatching. How do we know who should review which proposal?
In Telescope Time without Tears, they distributed reviews to everyone who applied to use the telescope, requiring each applicant to review three other proposals as part of their application.
Now, onto problem two. We might use a similarity score or semantic text matching to select the most qualified reviewer to review. The most relevant researchers would be those who publish in a similar field, and we can match the most closely matched papers with each other.
A limitation emerges within this mechanism - if we optimise for the closest matches between papers or applications, there might be 'losers' where one paper or proposal gets exceptionally poorly matched reviewers. Grossi proposes an alternative approach for a more egalitarian assignment for this case. This alternative method maximises the worst match score between a submission and a reviewer. This mechanism implies that while we may reach fewer 'exceptionally matched' papers and reviewers, even the least well-matched paper-reviewer pair is as high as possible.
Act Two: Evaluation
In the evaluation phase, reviewers evaluate the submissions they matched with. A core problem in this stage is potential manipulation: Reviewers might submit evaluations to strategically influence the outcome of their own submission, e.g. by giving low scores to direct competitors. Davide introduced a mechanism that gets around this problem:
We can partition the submissions into distinct groups so that everyone evaluates submissions from a set that does not include their own submission.
Suppose we were to do this for the telescope example. In that case, we'd have a series of proposals come in, and we could separate the proposals and their authors into two groups: one for the upcoming three months, group A, and one for the three months after that, group B. Group A authors would review only submissions from group B, and vice versa.
This partitioning method offers a way to distribute the reviewing workload equally and eliminates the incentive for referees to behave strategically.
Act three: Aggregation
Finally, we need to aggregate the outcomes of the received reviews from the various sets. The core difficulties here emerge from how individual reviewers assign a specific score to a submission, which could be biased by things like the mood of the referee or whether the proposal was scored before lunch or afterwards.
Davide discussed two main approaches to address this problem:
What's next in Peer Review?
The past two decades have seen an encouraging uptake of experimentation around peer review: open, blinded, pre- and post-publication, and so on. Scientists have started to look at peer review through the lens of science, new journals have emerged dedicated to the topic (e.g. Research Integrity and Peer Review), and meta-science centres and initiatives have been created that take these issues seriously (e.g. METRICS at Stanford). But it is still early days. More theory, controlled experiments, and empirical evidence are needed. A brighter future may lie in replacing the peer-review system we’ve built with one based on scientific principles and robust empirical evidence.