The replication crisis is about more than just the 'publish or perish' academic system, lack of replications, or p-hacking. It's about how we preprocess our data, how researchers might conceptualize a study, and how hard it is to get sufficient sample sizes to make generalizable claims about specific outcomes.
This reality makes the replication crisis much more complex - we can't just weed out the 'bad apples' or give everyone checklists to eliminate the lack of replicable results. We have to work with the inherent variability of science.
In Anna's talk, she begins to show us how we might work with this variability.
The problem with replication
Anna kicked us off with a review of what makes replication such a complex result to aim for.
The first step for any researcher upon deciding on a hypothesis is to design how they will answer their question. This process is far from a science. It's influenced by the researcher's available resources, experience, collaborators around them, and many other factors. Out of this diversity, Huber et al. found that even if you provide many teams with the same hypothesis, their diverse designs each say something different about the hypothesis.
We leave the design phase already with a plethora of different possible results. Now we have data, and we need to process it. Dreber highlighted the many analysists problem. Silberzahn et al. explored the natural variation in analysis results. They asked 29 teams to answer a single hypothesis with a single dataset. The effect sizes ranged from 0.89 to 2.93 (Mdn = 1.31). 69% found a positive result, 31% did not. They got this result even when the team forced researchers to use the same design and the same dataset.
Each of these degrees of freedom spurs as many different potential results as there are researchers. The publication machine then curates this variation. It chooses the largest effect sizes with the most certain statistical significance.
This reality makes replication even harder. When beginning a replication, you must first decide your sample size. Due to the publication bias, most published literature overestimates the actual effect size. This means that when calculating the appropriate sample size for a replication study, most teams need to pay more attention to the necessary sample size making it challenging to draw meaningful conclusions.
The moral? We can glean very little from any single study - we need more replication, not less. And yet - as we publish more and more every year. What would it take to replicate the scientific record effectively?
Relieving the burden of replication
Recognizing the inherent variability in research implies we need new solutions. Anna explores several that aim to aggregate across intuitions, designs, and other sources of variability and attempt to get to a more precise ground truth.
She begins by asking, can we predict what will replicate without replicating it?
Prediction markets are markets where participants trade contracts representing the probability of a specific outcome. Participants choose contracts that claim whether a study will replicate or not. Each contract pays out based on the replication outcome and if you had accurately predicted this outcome. In the end, we can interpret the market prices of these contracts to reflect the collective beliefs of participants about the likelihood of successful replication outcomes. In Nosek et al.'s research on this, they found something systematic about research that didn't replicate. 74% of the time, researchers correctly predicted whether a piece of research would successfully replicate.
These markets benefit us by aggregating diverse perspectives on results to determine the likelihood of their design replicating. This aggregated judgment gives future researchers an explicit prior on which they might base their own evaluation of this paper. If 80% of researchers generally believe that the research will replicate, a researcher can be more confident that they can trust this report even without a replication. On the other hand, if the market shows that researchers disagree about the outcomes, replicating the study may be most valuable to understand the effect better.
Another solution was a pre-analysis plan (PAP). PAPs clearly define hypotheses and experimental designs before conducting studies. This pre-definition doesn't change the natural variation in design, but it does avoid publishing false positive results. However, these designs limit exploratory analysis or other types of research.While prediction markets aggregate over the variation and can help us allocate our resources for replication more wisely, and Pre-analysis plans can help us trust the results better, the variation still exists. For Anna, we can embrace this natural variation. It requires cross-team collaboration, coming up with multiple designs, and leaning into differences. When attempting to make a claim, she implored researchers to explore how we might collaborate with others to explore numerous angles of the data to ensure generalizability.