Reproducible research is not hard. Why do so few researchers do it?

In this question we refer to reproduction of computational aspects of reproducing scientific work, not to the repetition (or replication) of for instance lab or field experiments. Four years ago, we wrote an EOS forum piece (together with Roger Bivand), where we explain how R can help reproducible geoscientific research. Why does the large majority of scientific papers published still come without any reproduction materials (data, scripts)?

A lot has been written about reasons why scientists are not very keen on sharing data and computational procedures when publishing their research results. The answers range from fear for discovery of errors, loss of competitional advantage, additional efforts required, and the lack of tangible benefits or rewards.

Scientists increasingly publish software, shown by exponential growth of the number of CRAN packages, and submit more papers about this, as we witness in the number of manuscripts submitted to the Journal of Statistical Software, but the number of regular research papers that come with data and scripts does not take off.

In the DFG-funded project ``Opening Reproducible Research’’ (O2R), we asked the question why few papers are reproducible, and address three sub-questions:

How can we make reproducibility easy?
How can we make reproducible papers interactive?
How can we help researchers to long-term archive their research?

The poster we presented last week at EGU addresses them.

How can we make reproducibility easy?

How do we submit reproduction material along with a manuscript of a scientific paper to a journal? Journals are usually quite unspecific about requirements (create a zip file? publish in a GitHub project? upload in a data repository? on runmycode?), and as a consequence both editors and reviewers are faced with a strong heterogeneity in submitted material, leading to an increase in review efforts or the ignorance of the supplementary material.

Our project O2R will propose a (simple) standard and develop tools to helps authors, editors, reviewers, and later on readers to prepare, review, understand and use the submitted reproduction material. The publication cycle in our EGU poster explains how this works. The duality of (a) an open specification how data and code should be packaged, and (b) the tools to create, share and execute such packages will turn formely supplementary material into core components of a scientific paper.

Structuring requirements for reproduction material makes life easier for everyone, and will increase uptake in the publication cycle.

How can we make reproducible papers interactive?

Have you ever wondered why diagrams and graphs in an arbitray on-line newspaper are interactive, but those in scientific papers are not? If we demand scientific manuscripts to provide their data in a structured form, it is rather trivial to develop tools that present this data in interactive graphs or maps, without requiring from scientists that they understand anything of JavaScript.

Other forms of interaction that become possible with the reproduction material is the modification of (input) parameters, or input data: a simple user interface could be developed automatically that allows a viewer of the paper to modify a parameter (or exchange input data), and see how it affects the resulting graphs. This could lead to earlier and further uptake of the research, new papers from the reader, and more citations for the author.

Creating interactive papers may be, or may become an incentive for researchers to prepare their reproduction material in a structured way, and enable new ways of doing science.

How can we help researchers to long-term archive data?

According to many regulations from research funding agencies (such as the DFG guidelines on good scientific practice), researchers are required to keep scientific data for at least 10 years. But where and how should they do that? What is the data worth without accompanying software to document and analyze it?

University libraries know what is involved in long-term archiving, and have their own standards and workflows for this. Our university library is full (funded) partner in O2R, and wants to develop an offering to researchers for long-term archiving their research material. This involves development of tools, workflows, an archiving system, metadata management, and training of library employees.

Simple tools for the creation of long-term archivable research packages will be an incentive for researchers to actually create and use them.

The future

Right now, publishers are undecided what they can, and should offer to enable reproducibility, and we, scientists, should tell them what to do. Editors are not very decided, and are often from the generation that oversee Today’s technical possibilities. Researchers are, ehm, pretty often inert when something takes them additional time. Librarians are aware of the challenges in archival of digital information, but need collaborations with scientists to go beyond mere byte streams, texts and images to keep the meaning of research alive.

In the O2R project, two publishers, Elsevier and Copernicus are involved as external partners, and will join in a workshop on May 17 where we will decide on requirements for the first year. Both are highly interested in the possibilities and potential of the project. To learn about the workshop results and future project outcomes, follow the project blog, Twitter account and GitHub org.

The O2R project started in 2016, is funded for two years and employs three full-time researchers and three student assistants. A second phase of three additional years of funding may be requested when the first two years make believe that we can develop production-ready infrastructure during the second phase for the open access transformation beyond our university.