By Daniel Nüst and Markus Konkol
Research today is complex and often requires diverse skills. Many researchers around the globe collaborate on small or large scale projects to solve global challenges. Often, these challenges can only be solved in a {intra, cross, multi, inter, trans}-disciplinary groups and with the help of computing. If you need computers and algorithms to analyse data, no matter how small or big, you are in the realm of computational research. This in in stark contract to how a large part of scholarly communication, i.e. "creation, publication, dissemination and discovery of academic research", works today. Results are mostly communicated with static documents, and to a large degree even in a format best suited for printing documents on paper — the good'ol PDF. However, scientific output is much more than just PDFs! The scientific article of the past does not suffice anymore to communicate knowledge. That is why it almost feels like the widespread use of computers broke science, and we now must adjust the way we share and community research. A research project today does not only create static texts as outputs, but the related data and software used to create, process, and visualise that data. And publishing all these building blocks of research in a reusable and sustainable way is very very hard! The technical solutions and environments exist, but establishing them as common scientific practice takes time and a lot of persuading because researchers will have to adjust their habits and even give up a few beloved ones.
This blog post will try introduce some arguments and present some tools to start persuading you to adopt comutational research practices. We will try not to fall into technology solutionism, which is very hard as software engineers, but being open for new technologies and a little bit of code will get you very far. Is there are reproducibility crisis? Points of view differ. Your discipline might be better or worse off, better or worse prepared, but we're not going to answer that question here, but focus on the following one.
Well, the short version of it is: we need carrots & sticks, money & time. So no big deal?! In our view, there will be no crazy blockchain-based platform for science that completely replaces the funding and review mechanisms we know today, though you should be aware of topics on scholarly publishing that are widely discussed. Editors and reviewers at journals serve an important purpose as gatekeepers and curators. Instead, we should introduce concepts and processes into science that embrace digitisation, especially in a way that allows small and independent journals to embrace them, too. Such change will need time. Until there are stronger requirements and an open infrastructure, we must note that more and more researchers do increasingly publish all of the building blocks of science, namely data, software, and the used computing environment. The Open Science movement has achieved tremendous success, particularly in the areas of Open Access and Open Data. We expect Open Methods and Open Code to follow suit within the next years — again, change will take time.
Are Open practices prevalent? No, not yet, despite the large number of benefits (watch out, that's one paper link per word and many carrots for you!). However, software publications (e.g., JOSS and JORS), data publications, software review, open peer review, and not the least preprints and prereviews have jump started academia's catch up with digitisation in the last years. Scientists themselves have the power to push these changes further as individuals, community members, and holders of offices. Across these stellar examples and new ways to enhance science cut many challenges on software sustainability, data and software citation, et cetera. This is where turn to money, carrots, and sticks: only if funders and evaluators (job boards, reviewers) recognise benefits of transparency and reproducibility will the requirements and rewards be put into place to educate, promote, fund, and eventually demand higher standards on openness.
So what's the golden technology that saves us?
There is one key concept that we think every researcher should be aware of: the research compendium. Publishing a research compendium means that all building blocks of a research paper are shared in a coherent package. Just as with a research manuscript, you take a snapshot of your current work and publish it in reusable form under an open license in a suitable repository, i.e., a repository assigning a permanent identifier. A research compendium can take many forms, and every scientific discipline should have a public discourse about their minimal standards and means to handle specific requirements. Two core technologies underpin many advanced research compendia: notebooks based on the literate programming paradigm and containerisation. Containers allow to capture the virtual computing environment in a portable and well-defined way. So if you want to up your game in Reproducible Research and Open Science, these are two practical tools you need to get a handle on. Luckily, there are plenty of resources to educate yourself, and likely also one related to your discipline. Watch out for courses and workshops related to openness and reproducibility and the next event you participate in.
If you embed a container into your research compendium, we call it an executable research compendium (ERC). The research compendium can then undergo a peer review, ideally including a reproduction of the entailed computational workflow. Research compendia have the potential to enhance peer review and greatly improve scholarly communication. That is precisely what some authors, communities, publishers and journals are starting to do today. If you are an author, reviewer, or editor getting in contact with a manuscript that includes data and computations, you should consider leveraging research compendia for better reproducibility. If you want to learn more about research compendia and potential implementations, take a look at the webiste research-compendium.science. If you are interested in reproductions as part of peer review, take a look at CODECHECK. If you want to learn more about the platforms that are build around containers, notebooks, and research compendia, take a look at a recent preprint on infrastructures for publishing computations research.
We are enthusiastic about openness and reproducibility! We acknowledge there is a spectrum and perfection should not stop you from getting started. Every small step counts! However, some degree of whataboutism exists and we try to debunk some perceived concerns:
... sensitive data?
The solutions exist and range from anonymisation, synthetic data, to public infrastructures and access control. This is a question of establishing sustainable processes and long-term infrastructure, so mostly a challenge of funding and pervasion.
... big data?
If your workflows use a bespoke high-performance computing with huge datasets of several Petabytes, we admit a complete reproduction during peer review is unlikely to happen. But you never know! The practices of making sure someone else could reproduce everything will improve your work's quality. In this case, you should include a synthetic dataset or data subset in your research compendium, so that others can explore and understand your work within more widely available computing resources.
... sustainability and reusability?
A research compendium as presented above is a snapshot for your work at a specific point in time, but does not touch on reusability and sustainability. In most guidelines about using research compendia you will actually learn about a good working process, not just about sharing the final product. You can use version control (git) and online collaboration platforms (e.g., GitLab or GitHub) to take advantage of the structure and tooling around packaging your research every day, not just when you finish a specific project.
Future you is your best collaborator!
Therefore, all effort you spend on documentation is very well spent.
If nothing else, then have a README
that would be sufficient for getting yourself started after one year away from a project.
Documentation for others can be added on demand.
Furthermore, using open file formats (e.g., CSV
instead of xlsx
) and reasonable file names will make your work more accessible to others and future you.
If fellow scientists can build upon your data or reuse your code on their own dataset, new collaborations can happen quickly.
... licensing?
Copyright is important — without it, you would not be able to assign a license to enable others to reuse your work. Copyright is also complicated, because it varies greatly across countries. It must be acknowledged that for some researchers, e.g., in industry collaborations or when using non-open software, publishing all data and software openly is a problem. For the majority of publicly funded research, however, closed science practices remain because knowledge, education, habits, and best practices are missing, and concerns around licenses are not yet debunked.
On a more technical note, more work needs to be done to assign suitable licenses to data, software, and text within a compendium. Different license types are needed for these artefacts, but it's not simple yet to achieve transparent licensing.
... some researchers not buying in?
Change takes time, but the disruptions in scholarly communication and the problems, such as publication bias or p-hacking, are too pressing so that the scientific community cannot take too long to address them. When we improve metrics, use alternative ones, or adjust researcher assessment, then "open science" will become "science" and "reproducible research" will become research again. Until then, the spirit of preproducibility can unite the researchers interested in better research today. The key challenges are social, not technological!
Thanks for asking!
We hope that research compendia power the scholarly communication process. Of course we would be proud if our own concept of the executable research compendium is still used. The ERC features the idea of bindings to let readers more easily interact with workflows and link visual presentation better to the actual used code and data — and all of that without authors becoming web developers.
With a large body of research compendia published, we expect completely new and exciting ways of discovering, integrating, and interpreting research, even using algorithms for reasoning on research compendia. A "default to open" will save a lot of time and money: easy collaboration and reusability is the norm, because applying the new algorithm of paper A to the dataset of previously published paper B will be possible with a few clicks in an online platform. Federated access to data and computing infrastructure is transparently negotiated in the background. Contributions to the newly created work and citations of leveraged inputs will be stored in the metadata without manual intervention. Adaptive presentations on research results will exist for different user types, such as decision makers, citizens, or journalists, so that current research is much better connected to the general public. It will be like... the future! We call it an Open Research Infrastructure for Geoinformatics, but any community can build upon and extend these ideas as they need.
We are currently running small pilot studies to demonstrate the concept and prototype in the "real world", but could use more interested authors and editors. One of the pilots plans to enhance the Open Journal System (OJS) with research compendia, as we expect universities, independent publishers, and non-profit organisations to increasingly create alternatives to large commercial publishing companies. In light of COVID-19 taking a hold of humanity, the limits and challenges around science communication, both within academia and, more importantly, with the public, are broadly revealed. We should take the opportunity to focus not only on reducing (oftentimes perceived) current barriers, but to foster cultural change and new innovations. Let's do it — individually and as communities.
We are a part of the team behind the project Opening Reproducible Research (o2r). o2r is supported by the DFG You can follow us on Twitter @o2r_project and check out our software on GitHub.com/o2r-project/. You can also fully dig into all our plans and ideas, because we published our full research proposal. This post gives just a glimpse into the challenges of scholarly communication and academic publishing and rewards systems.