Systemic Factors Behind the Replication Crisis in Psychology
Professional incentive systems shaped by a systemic preference for statistical significance play a key role in psychology’s replication crisis. Though scientific progress hinges upon the accumulation and dissemination of new knowledge, those involved in the publication process have mistakenly equated new and important findings with statistically significant results. As a result, journals are more likely to publish significant findings over null results. However, in academia’s highly competitive ‘publish or perish’ culture, career success for researchers is defined by their publication output and impact. Given the well-documented existence of publication bias, it therefore stands that a preference for positive findings within journals will motivate the pursuit of significant results among researchers. As such, it is argued that external pressures to produce significant findings will shape how researchers design, analyse, and report studies such that positive results are more likely to arise. As significant findings become more common, both true and false positives will become increasingly prevalent in published bodies of literature, resulting in low replicability. Looking more broadly, institutional incentives also motivate researchers to overstate research outcomes and seek theory-supportive data. These positivist research practices are enabled by the methodological flexibility associated with psychology and its indirect measures of intangible constructs, which can inflate the false discovery rate in research. When considered in combination, it becomes clear that the incentive systems that unintentionally reward positivist practices have allowed for the continued survival of dud theories, much to the detriment of research integrity and credibility. Thus, the existence of external motivations that biases research output has increased the share of false positives in published literature, acting as one of the central factors behind the replication crisis in psychology.
Though failures to replicate published findings have heightened scepticism around the credibility and integrity of psychological research, critics often fail to account for the widespread prevalence of low replicability in other academic disciplines. The results of recent efforts by the Open Science Collaboration (2015) to replicate 100 results from top-tier psychology journals has created a sense of panic around the validity of published empirical findings. In particular, the study revealed that only 36% of the findings yielded significant results again, with replication effect sizes being found to be only around half the size of what was originally reported (Open Science Collaboration, 2015). From this, many concluded that the low rates of successful replications were indicative of a concerningly high prevalence of false positives and overstated effect sizes in existing literature (Ioannides, 2005; Zwan, Etz, Lucas, & Donnellan, 2017). Though such low reproducibility has been used as an indictment of psychology as a scientific discipline, critics of the field have neglected to consider how failures to replicate empirical findings occur in many other areas of academic research, including cancer research (Begley & Ellis, 2012), strategic management (Bergh, Sharp, Aguinis, & Li, 2017), and economics (Camerer et al. 2016). Though mainly prevalent in the domains of social and biomedical sciences, some have speculated that disciplines that have not struggled with failures to replicate have simply not yet systematically examined the issue (Zwan, Etz, Lucas, & Donnellan, 2017). Ergo, with the interdisciplinary and widespread nature of low replicability, it stands to suggest that the root cause of failures to replicate is endemic across multiple spheres of academic research – the institutional systems that support and facilitate research.
Institutional demands for a strong publication output and impact motivate the pursuit of novel and significant research findings. As entities whose reputations rely at least in some part on the impact of their research output, research institutions have long prioritised and sought out those who exhibit strong scientific productivity (Bazeley, 2003). However, as the influence of scientific work is difficult to measure, journal publications and citations have come to represent one of the few quantitative benchmarks of an academic’s contribution and impact to the field (Fanelli, 2013). As a result, publication output and impact are often prerequisites for professional success among researchers, where they are frequently tied with promotion, pay and development opportunities (Blackmore & Kandiko, 2011). In this respect, it is clear that researcher behaviours would be strongly motivated by the publication pressures endemic to academia’s highly competitive ‘publish or perish’ culture.
However, the incentive systems associated with such pressures increases researcher susceptibility to the biased processes underlying the selection, publication, and citation of journal articles. Given the importance of publication output and impact on career success, it would be reasonable to expect that researchers would adapt and conform to the publication priorities of various journals (Franco, Maholtra, & Simonovits, 2014). However, in their search for novel findings that form the basis of scientific progress, these journals often equate novelty with statistical significance. In failing to recognise that non-significant results can in itself be a novel finding, systemic biases that favour statistically significant findings over null results exist at both the institutional level and at the individual level (Drotar, 2010). As such, these biases are intertwined with publication output and impact in the form of journals prioritising the publication of significant results over null findings and fellow researchers tend to cite positive results more often (Coursol & Wagner, 1986; Fanelli, 2013). Given the widespread emphasis on statistical significance, those who submit null findings for publication face barriers throughout an editorial peer-review process that is averse to non-significant results (Smith, 2006). This exists in the form of comments from editors and reviewers who express that non-significant results are difficult to interpret or possible false negatives (Ferguson & Heene, 2012). As a result of these difficulties, researchers eventually decline to submit null results and are further motivated to pursue significant findings that are considered publishable (Greenwald, 1975; Smith, 2006; Franco, Maholtra, & Simonovits, 2014). Therefore, it becomes evident that researcher behaviours are influenced by flawed systems for professional success.
Forming the basis of the file drawer problem, journal biases encourage investigators to report significant results over null findings (Rosenthal, 1979; Chalmers, Frank, & Reitman, 1990). Indeed, as publication pressure increases, researcher bias towards reporting positive results also increases, reflecting an increased awareness among researchers that significant findings are more valued among journals (Fanelli, 2010a). This ongoing search for significance on both a journal and researcher level results in an over-representation of significant results in published literature (Fanelli, 2012). However, the strong emphasis on publishing and rewarding significant findings ignores the reality that significant findings consist of both true and false positives (Ioannidis, 2005; Zwan, Etz, Lucas, & Donnellan, 2017). As true negatives in the form of null findings are avoided by researchers and filtered out by journals, the subsequent publishing of mainly significant results leads to inflated false discovery rates in existing bodies of literature. Given that attempts to replicate false-positive findings are more likely to fail by default (Smaldino & McElreath, 2016), it is reasonable to suggest that low replicability is a by-product of the emphasis on significant findings endemic within the academic system of journals, research institutions, and investigators.
The systemic biases that favour significant findings also manifest themselves at the individual level, compromising investigator objectivity in research design. From deciding the sampling strategy to the variables to measure and how, investigator objectivity in the research design stage can be compromised by the knowledge that significant findings are associated with greater professional benefits relative to null results (Franco, Maholtra, & Simonovits, 2014). Regardless of their underlying intentions, biased research designs encourage false positives, which then increase the likelihood of achieving and publishing a significant result (Smaldino & McElreath, 2016). These effects of investigator bias on research design and, by extension, results can be far-reaching. For example, in clinical research, researcher allegiance remains the best predictor of randomised trial outcomes, where it accounts for 69% of the effect size for psychotherapies (Luborsky et al., 1999; Dragioti, Dimoliatis, & Evangelou, 2014). Given the system-wide preference towards significance, no psychological discipline is completely protected from similar biases as researchers, by default, aim to seek supportive evidence for their hypotheses (Heino, Fried, & LeBel, 2017). Occurrences of such researcher biases is even illustrated in a meta-study investigating the prevalence of questionable research practices among psychology researchers. As a widely cited study by John and colleagues (2012), one of the key findings was that engaging in such practices was the prevailing norm within psychology and even deemed acceptable despite their detrimental impact on replicability. In fact, it was found that the majority of respondents reported to have engaged in at least one questionable research practice like not reporting all dependent measures (John, Loewenstein, & Prelec, 2012). However, others have noted that the biased wording of the survey may have prompted respondents to over-report instances of questionable practices, which would artificially inflate prevalence estimates (Fielder & Schwartz, 2016). Ergo, it becomes clear that the credibility and integrity of research findings can be negatively impacted by biased research designs. As such biases emerge when investigators are psychologically invested in a desired outcome, systemic preferences for significant findings reduce the objectivity of research designs (Leykin & DeRubeis, 2009). Consequently, the false-discovery rate is inflated and contributes further to the issue of low replicability in published literature.
However, it should be noted that false positives are a natural part of scientific investigations as exploratory research, which more often yields false alarms, also provides guidance for later investigations. In particular, scientific progress necessitates the exploration of research domains that have been underserved by academia in order to develop novel findings (Blackmore & Kandiko, 2011). As such, in addition to testing hypotheses, investigators can and should explore their datasets in order to generate additional hypotheses that can be examined in future confirmatory research. Such exploratory research involves the use of multiple analytic alternatives to sift through the data until one yields a significant result (Zwan, Etz, Lucas, & Donnellan, 2017). As multiple analytic procedures are tested, the likelihood of at least one analysis producing a false-positive finding is, by necessity, greater than an equivalent confirmatory analysis (Simmons, Nelson, & Simonsohn, 2011). Therefore, while the reporting of significant results in exploratory research informs new directions in the field, researchers should be informed that multiple analyses were used such that they are aware of the increased false discovery rate.
In addition to increased false discovery rates, research credibility and replicability is also affected by an inability to detect false positives in existing literature. Though exploratory research can and should guide later investigations, the reporting of such exploratory findings is influenced by a researcher’s desire to construct a compelling narrative. Given professional incentive structures surrounding journals and research institutions, it would be reasonable for researchers to report their findings in a way that convinces readers of the study’s scientific value, such as through an emphasis on significant findings. However, in attempts to construct a compelling and succinct argument, the ambiguity of best research practices gives rise to a problem whereby many researchers present exploratory findings as confirmatory (Ionnadis, Fanelli, Dune, & Goodman, 2015; Simmons, Nelson, & Simonsohn, 2011; John, Loewenstein, & Prelec, 2012). Therefore, while many researchers believe in the integrity of their own research and view their research practices to be acceptable, systemic incentives give rise to biased views of what ‘acceptable’ practices may entail (John, Loewenstein, & Prelec, 2012). In this respect, a lack of awareness of best research practices further contributes to the problem, where the reporting of exploratory research as confirmatory inhibits the ability to detect spurious findings that are more likely to be false positives (Nosek, Ebersole, DeHaven, & Mellor, 2018). If there are efforts to examine effects which are false positives, the null findings yielded from such attempts are less likely to be reported due to the systemic preference for significance (Smaldino & McElreath, 2016). In addition to this, it would be more likely that the additional false positives found in significant findings are published, further building an evidence base for the existence of the original false-positive finding. Therefore, it is clear that combination of ambiguity around best research practices and a system-wide premium placed on statistically significant and novel findings gives rise to questionable reporting practices that can reduce research integrity and therefore replicability.
Research credibility is further compromised by researchers who are inadvertently encouraged to overlook the possibility of false positives when defending the scientific value of their research and their professional worth. As a part of the research process, investigators should explore the implications of their study against a broader theoretical context such that they can advance scientific knowledge (Ferguson & Heene, 2012). However, conflicts of interest arise when a persuasive narrative is also required to justify the use of institutional resources when conducting research (Lilienfield, 2017). Consequently, as the world of academia becomes increasingly competitive, researchers face growing pressures to convince institutions of their scientific impact such that they can acquire and sustain the funds needed to further support their scholarly pursuits (Bazeley, 2003; Lilienfield, 2017). While it is important for researchers to remark upon a study’s contribution to the field, these institutional pressures inadvertently incentivise the exaggeration of a study’s importance to the field (Lilienfield, 2012). Given the association between perceived scientific impact and career incentives, researchers are therefore motivated not only to overstate a study’s contribution to the field, but also to overlook the possibility of false-positive findings. However, with pressures to produce greater theoretical contributions under tighter deadlines, researchers are increasingly favouring the collection of theory-supportive evidence over time (Heino, Fried, & LeBel, 2017; Fanelli, 2012; Lilienfield, 2017). Such actions may not even be intentional as research around motivated reasoning suggests that investigators will construct biased arguments that favour their preferred outcomes (Kunda, 1990). Considered more broadly, this implies that those who are invested in certain theories or hypotheses may continue to attempt to find supportive evidence, even when it is no longer conducive to scientific progress (Ferguson & Heene, 2012; Greenwald, Pratkanis, Leippe, & Baumgardner, 1986). Thus, to the detriment of scientific integrity, it is evident that systemic factors have inadvertently incentivised a positivist approach to research, enabling the continued existence of dud theories in the literature.
Scholarly efforts to find supportive evidence for favoured theories are further enabled by the inability to directly measure psychological constructs. Psychology as a scientific discipline is built upon the operationalisation of variables, wherein changes in psychological constructs are mapped onto tangible changes in a real-world measure (Machery, 2007). As new measures of the same constructs can be developed and implemented, there are countless ways to operationalise variables and measure research outcomes, which introduces new avenues for investigator biases to impact. In particular, while a lack of construct validity is a noted issue in behavioural sciences, researchers are motivated to overlook such issues if they yield the preferred results (Flake, Pek, & Hehman, 2017). Though endemic to all social sciences, the suboptimal quality of construct measurements has become a particularly large problem for the field of psychology (De Boeck & Jeon, 2018; Fanelli & Ionnadis, 2013). In particular, due to higher noise and lower construct validity, flexibility around how to measure intangible constructs increases the rate and magnitude of false-positive effects (Fanelli & Ionnadis, 2013; Fielder, 2018). Due to the unique methodological flexibility found in psychology, it is of no surprise that theory-supportive findings are far more prevalent in psychology relative to sciences that utilise more direct measures of their variables, such as engineering (Fanelli, 2010b). As a result of the field’s inability to directly measure related constructs, psychology’s positivist approach to research has allowed for the accumulation of spurious supportive evidence for theories, whose credibility is impacted by the use of shoddy measures (Holtz & Monnerjahn, 2017). When considered in conjunction with how publication bias inhibits the dissemination of null findings which may falsify such theories, it is clear that the issue of low replicability is symptomatic of larger issues around scientific theory development in psychology, in which false-positive findings can form an evidence base for flawed theories.
Overall, by adopting a broader perspective of the replication crisis in psychology, it becomes clear that systemic biases which place premium on significant findings interact with institutional incentive systems to inflate rate of false positives in published literature. More specifically, as career progression is based upon publication output and impact, flaws in the publication system will engender flaws in research practices. Due to the existence of publication bias in the editorial and peer-review process, researchers are motivated to seek out significant findings to improve their publication output and subsequently further their career in academia. This search for significant findings biases the entirety of the research process such that false-positive findings are much more likely. In consideration of the inflated false discovery rate and how null findings are filtered out of the publication process, it is evident that the replication crisis is indicative of a larger problem around psychological theory development. With incentives to engage in a positivist research approach, dud theories continue to gather a spurious evidence base of false positives while null findings that may falsify such theories have traditionally been difficult to publish. Therefore, to drive scientific progress in psychology, researchers should look towards addressing the underlying factors that contribute towards the replication crisis in psychology and the wider issue of flawed theory development.