You are here: Home Departments Neurobiology of Language Yet more evidence for questionable research practices in original studies of Reproducibility Project: Psychology

Neurobiology of Language -

Yet more evidence for questionable research practices in original studies of Reproducibility Project: Psychology

A recent publication in Science claims that only around 40% of psychological findings are replicable, based on 100 replication attempts in the Reproducibility Project Psychology (Open Science Collaboration, 2015). A few months later, a critical commentary in the same journal made all sorts of claims, including that the surprisingly low 40% replication success rate is due to replications having been unfaithful to the original studies’ methods (Gilbert et al., 2016). A little while later, Richard Kunert published an article in Psychonomic Bulletin & Review re-analysing the data by the 100 replication teams (Kunert, 2016). He found evidence for questionable research practices being at the heart of failures to replicate, rather than the unfaithfulness of replications to original methods.
Yet more evidence for questionable research practices in original studies of Reproducibility Project: Psychology

However, his previous re-analysis depended on replication teams having done good work. In this blog post he will show that even when just looking at the original studies in the Reproducibility Project: Psychology one cannot fail to notice that questionable research practices were employed by the original discoverers of the effects which often failed to replicate. The reanalysis he presents here is based on the caliper test introduced by Gerber and colleagues (Gerber & Malhotra, 2008; Gerber et al., 2010):

The idea of the caliper test is simple. The research community has decided that an entirely arbitrary threshold of p = 0.05 distinguishes between effects which might just be due to chance (p > 0.05) and effects which are more likely due to something other than chance (p < 0.05). If researchers want to game the system they slightly rig their methods and analyses to push their p-values just below the arbitrary border between ‘statistical fluke’ and ‘interesting effect’. Alternatively, they just don’t publish anything which came up p > 0.05. Such behaviour should lead to an unlikely amount of p-values just below 0.05 compared to just above 0.05.

The independent replications in blue show many z-values left of the dashed line, i.e. replication attempts which were unsuccessful. Otherwise the blue distribution is relatively smooth. There is certainly nothing fishy going on around the arbitrary p = 0.05 threshold. The blue curve looks very much like what I would expect psychological research to be if questionable research practices did not exist.

However, the story is completely different for the green distribution representing the original effects. Just right of the arbitrary p = 0.05 threshold there is a surprising clustering of z-values. It’s as if the human mind magically leads to effects which are just about significant rather than just about not significant. This bump immediately to the right of the dashed line is a clear sign that original authors used questionable research practices. This behaviour renders psychological research unreplicable.

The conclusion is clear. There is no strong evidence for replication studies failing the caliper test, indicating that questionable research practices were probably not employed. The original studies do not pass the caliper test, indicating that questionable research practices were employed.

As far as I know, this is the first analysis showing that data from the original studies of the Reproducibility Project: Psychology point to questionable research practices [I have since been made aware of others, see this comment below]. Instead of sloppy science on the part of independent replication teams, this analysis rather points to original investigators employing questionable research practices. This alone could explain the surprisingly low replication rates in psychology.

Psychology failing the caliper test is by no means a new insight. Huge text-mining analyses have shown that psychology as a whole tends to fail the caliper test (Kühberger et al., 2013, Head et al., 2015). The analysis I have presented here links this result to replicability. If a research field employs questionable research practices (as indicated by the caliper test) then it can no longer claim to deliver insights which stand the replication test (as indicated by the Reproducibility Project: Psychology).

It is time to get rid of questionable research practices. There are enough ideas for how to do so (e.g., Asendorpf et al., 2013; Ioannidis, Munafò, Fusar-Poli, Nosek, & Lakens, 2014). The Reproducibility Project: Psychology shows why there is no time to waste: it is currently very difficult to distinguish an interesting psychological effect from a statistical fluke.


For full R-code to recreate the analysis and figures, as well as reference list and comments, please visit

Kunert R (2016). Internal conceptual replications do not increase independent replication success. Psychonomic bulletin & review PMID: 27068542

Neurobiology of Language

What is the neurobiological infrastructure for the uniquely human capacity for language? The focus of the Neurobiology of Language Department is on the study of language production, language comprehension, and language acquisition from a cognitive neuroscience perspective. Read more...

Director: Peter Hagoort

Secretary: Carolin Lorenz


Flag NL Het talige brein