Approachable Case Studies Support Learning and Reproducibility in Data Science: An Example from Evolutionary Biology

ABSTRACT Research reproducibility is essential for scientific development. Yet, rates of reproducibility are low. As increasingly more research relies on computers and software, efforts for improving reproducibility rates have focused on making research products digitally available, such as publishing analysis workflows as computer code, and raw and processed data in computer readable form. However, research products that are digitally available are not necessarily friendly for learners and interested parties with little to no experience in the field. This renders research products unapproachable, counteracts their availability, and hinders scientific reproducibility. To improve both short- and long-term adoption of reproducible scientific practices, research products need to be made approachable for learners, the researchers of the future. Using a case study within evolutionary biology, we identify aspects of research workflows that make them unapproachable to the general audience: use of highly specialized language; unclear goals and high cognitive load; and lack of trouble-shooting examples. We propose principles to improve the unapproachable aspects of research workflows and illustrate their application using an online teaching resource. We elaborate on the general application of these principles for documenting research products and teaching materials, to provide present learners and future researchers with tools for successful scientific reproducibility. Supplementary materials for this article are available online.


Introduction
Research reproducibility-the extent to which consistent results are obtained when a scientific experiment or research workflow is repeated (Curating for Reproducibility Consortium 2017)is a key aspect of the advancement of science, as it constitutes a minimum standard that allows understanding research products, that is, methods, data, analysis, results, etc. (Piwowar 2013), to determine their reliability and generality, and eventually build up scientific knowledge and applications based on those products (King 1995;Peng 2011;Powers and Hampton 2019). In the natural sciences, rates of reproducibility are low (Ioannidis 2005;Prinz, Schlange, and Asadullah 2011), which has elicited concerns about a crisis in the field (Baker 2016).
In response, the scientific community has been developing new principles and standards to incentivize cultural changes that support a long-term improvement of reproducibility rates in the natural sciences (Peng 2015;Wilkinson et al. 2016;Miyakawa 2020). A standard for reproducibility that has received much attention is availability, which we define as a property denoting that a research product can be reached (acquired, copied, analyzed, processed and/or reused)
In this article, we argue that research products that are digitally available are often unapproachable in practice, because they are not friendly for learners and interested parties with different levels of experience in the field. Research products that are unapproachable counteract availability, and hinder reproducibility short and long term. To support long-term adoption of reproducible practices in the natural sciences, research workflows need to be made approachable for learners, the researchers of the future (Roland et al. 2002;National Academies of Sciences, Engineering, and Medicine 2018).
To elaborate on our thesis, we designed a case study within the research field of phylogenetics, a discipline within evolutionary biology. We use our case study to identify barriers that have made research workflows largely unapproachable to a general audience in the natural sciences. Then, we propose some principles for researchers to address these barriers and create research workflows that are reproducible by a larger audience. The principles proposed here can be generalized and integrated into the undergraduate and graduate school STEM curriculum, either for courses specialized in reproducibility or within other subject areas, as a necessary component of successful and impactful science.

A Case Study from Phylogenetics
Phylogenetics is a key discipline within evolutionary biology (Dobzhansky 1973). It focuses on investigating the history of shared ancestry of living and extinct organisms using biological data and represents this evolutionary history with a diagram known as a phylogeny or phylogenetic tree (because it grows through time and appears to have branches; Figure 1). Phylogenies provide the basis to study and understand all biological processes in an evolutionary context (Dobzhansky 1973). Hence, it follows that improving reproducibility rates in phylogenetics has the potential to positively impact research across the natural sciences.
To explore barriers to approachable phylogenetics, we develop a case study that touches on three common problems within the field: standardizing organism names in phylogenies, obtaining current phylogenetic knowledge for a group of organisms, and summarizing this phylogenetic knowledge in a meaningful way. To address these problems, we propose a research workflow that relies on resources from the Open Tree of Life (OpenTree), an open source project that provides digital availability of phylogenetic results from published, peerreviewed research, which is considered to be vetted and state-ofthe-art knowledge in the field. OpenTree phylogenies are stored in a public database, the Phylesystem (McTavish et al. 2015), and are downloadable as various computer-readable file types, which is key for reusable and reproducible workflows (Wilson et al. 2017). OpenTree also provides access to a single standard for organism names (taxonomic standard) that is applied to the stored phylogenies (Rees and Cranston 2017), which are then  All of these resources are available for download and use from OpenTree, free of financial cost to any user. One way to access OpenTree resources is manually, through its Graphical User Interface (GUI; aka, a website or application that allows users to access and use functionalities with mouse or keyboard clicks). However, reducing as many manual steps as possible in research workflows is key for reproducibility, as manual data manipulation scales poorly and is prone to error (Bakken 2019). OpenTree's resources are also programmatically available through its Application Programming Interface services (APIs; aka, computer code that automatically implements functionalities, that is usually used by programmers to build different or tailored functionalities). While APIs provide data processing scalability and reproducibility (Open Tree Of Life et al. 2016), they come at a high technical and cognitive cost for the user, whom requires considerably more computer programming experience and literacy to be able to successfully use APIs. OpenTree's API services have been wrapped by the rotl R package (Michonneau, Brown, and Winter 2016) and the opentree Python module (McTavish, Sánchez Reyes, and Holder 2021). R and Python programming languages are open source and free of cost and represent two of the most widely used programming languages in the sciences today (Eglen 2009;Baker 2017). As such, rotl and opentree software packages are contributing to approachability of OpenTree's resources to R and Python users, increasing availability to a wider user base.
However, while learners in the natural sciences have been engaging independently with R and Python programming languages, computer programming is not traditionally a core skill formally taught to biologists and naturalists (Sayres et al. 2018;Wright et al. 2019;Williams et al. 2019). As computers continue to play a larger role in most scientific disciplines (Piccolo and Frampton 2016), higher baseline computational skills are required across all natural sciences not only to develop an original research workflow, but to be able to follow and reproduce research workflows from other researchers (National Academies of Sciences, Engineering, and Medicine 2019).
Thus, efforts to increase reproducibility rates long term in the natural sciences would benefit from addressing specific barriers for learners in the field, to support them in acquiring the skills needed to reproduce research workflows that rely heavily on computer code (Peng 2011;Sandve et al. 2013;Powers and Hampton 2019).
In the next section, we describe (in no particular order) three barriers to approachable research workflows that we identify using our case study. Then we develop a set of principles to address these barriers and apply the latter to a set of teaching materials that are available at https://mctavishlab.github.io/R_ OpenTree_tutorials/.

Identifying Barriers to Approachable Research Workflows
The main goal of our case study is to obtain a single phylogeny summarizing data from a set of published phylogenies for the canids (the family of dogs, coyotes, wolves, etc.), our organisms of study. All analysis for our case study can be completely accomplished using functions from the R package rotl or the Python module opentree. If a researcher were to use the proposed analysis workflow in a publication, they would typically describe it in the methodology section as "The canid summary phylogeny was obtained using functions from X package, details are available as supplementary materials. " This is usual practice, mainly because journals do not have space to publish all code used for an analysis in the methods section. Yet, supplementary materials and data have the misfortune to not be peer-reviewed as thoroughly (or at all) as the main manuscript (Pop and Salzberg 2015). They are also prone to the dreaded promise "available upon request, " which has very low rates of fulfillment (Krawczyk and Reuben 2012;Gabelica, Bojčić, and Puljak 2022). Without the primary data and code that was used to perform an analysis, it is impossible to reproduce said analysis (Miyakawa 2020 Some of these questions can be answered by referring to the software documentation, which is usually publicly available and can be accessed by any potential user. As opposed to code, software documentation is written in natural language (i.e., any known human language, e.g., English, Spanish, Chinese) and is considered a key element for successful adoption of software by target users (Karimzadeh and Hoffman 2018). This might explain why documentation for software addressed to academic users is also usually written using highly specialized computational language or jargon (i.e., computationally specific concepts, words, and phrases) as well as formal scientific and academic language. We identify this as barrier 1 to approachable research workflows-Specialized language is intimidating. While scientific jargon might have an important role for formal acceptance of software by the scientific and academic community, it can be perceived as cold and/or intimidating language that often slows down or even obstructs examination, application, and adoption of code by a wider audience (Ball 2017). In contrast, introducing information without the use of jargon supports learner's conceptual understanding of new ideas and concepts (McDonnell, Barker, and Wieman 2016;Pan et al. 2019).
Another element of good software documentation is that it has to be thorough (Karimzadeh and Hoffman 2018), meaning that it should describe general usage of individual functions, as well as arguments and variables that said function can take (Karimzadeh and Hoffman 2018). Individual documentation for each function is usually presented in alphabetic order and does not have a specific analysis goal. Moreover, most software has numerous functions, so documentation is usually very lengthy and it is hard to navigate. This can have the effect of increasing the amount of information that needs to be simultaneously processed by the users, which can lead to overload of the finite amount of working memory any one possesses, known as cognitive load (Sweller 1988). In this context, identifying connections across functions that are meant to work on the same analysis workflow can become a very difficult task. We recognize this as barrier 2-Lack of specific goals leading to high cognitive load. High cognitive load is know to have a negative effect in learning software (Chandler and Sweller 1996;Van Merriënboer and Ayres 2005;Lambert, Kalyuga, and Capan 2009).
A third important aspect of software documentation are examples that demonstrate usage of individual functions (Karimzadeh and Hoffman 2018). Examples presented in software documentation are usually worked to perfection, as they are intended to showcase the ideal or minimal case in which a function works well. Perfectly worked examples ignore the user experience by maintaining focus on the software content and fail to provide users with expert and clear advice on how to troubleshoot if needed. We identify this as barrier 3-Lack of trouble-shooting examples. Error management training is an approach that focuses on framing mistakes as beneficial to learning complex tasks, to give learners the opportunity to actively explore a task with a positive mindset (Frese 1995). Providing examples that showcase potential errors, supports user's performance (Steele-Johnson and Kalinoski 2014), and can greatly improve learner's ability to troubleshoot outside the classroom (Shannon and Summet 2015;Nederbragt et al. 2020).
In sum, best practices for good software documentation are not enough to promote reproducibility of published research workflows that rely heavily on code. In the following section, we describe some principles that can help to reduce or remove the identified barriers, to create research workflows that are more approachable and hence more reproducible by a larger audience.

Principle 1. Use Friendly, Relatable and Respectful Language
Avoiding formal language, and incorporating elements of pop culture, such as picture character icons known as "emojis, " make the language more familiar to a broader target audience (Figure 2). We made an effort to specifically complement the primary documentation by identifying computational concepts that were assumed or were not explained in depth. We vetted the tutorials through feedback from workshop participants as well as individual users to identify such specialized concepts.

Principle 2. Reduce Cognitive Load by Providing Specific and Clear Goals with Literate Programming
Cognitive load can be greatly reduced for learners by applying an active learning strategy such as linking usage to a "real world" or "human" application (Felder and Brent 2009). Programming computer languages are by themselves quite abstract and represent a learning subject with a potentially high cognitive load for most learners. Pedagogical research shows that active learning practices are one of the most effective ways to take on abstract subjects (Freeman et al. 2014). A story-like narrative that links code usage in an integrative example, invites learners to try the code, which can lead them to remember what they are doing and why they are doing it. This "literate programming" paradigm (Knuth 1984;Fritzson, Gunnarsson, and Jirstrand 2002) makes code more approachable, as it integrates narratives with computer code in the same document, supporting learners in actively following the code usage, supporting memory and understanding (Piccolo and Frampton 2016). We propose that documents developed with "literate programming" can be made more accessible by choosing narratives that are relatable to a more general audience. An easy way to do this in biology is choosing a charismatic taxon as a model organism. For a research group, this can be the biological group they are studying. For the general audience, a highly charismatic group-such as dinosaurs, should work well. For example, when we presented our tutorial to the Amphibia Web Organization (van der Meijden et al. 2002) in January 2020, we tailored all examples to frogs and their allies.
We examined available software documentation for the package rotl and designed a narrative that requires the usage of as many functions as possible. We demonstrate code applications that are commonly requested by OpenTree users, but that are not demonstrated in the documentation of the package. By framing the function workflow using highly requested uses, the documentation acquires a narrative arc that is easier to follow and remember by users. This can also facilitate translating the code application to other use cases of interest for learners in biology.

Principle 3. Provide Examples That Are User-focused by Demonstrating Errors and Warnings
An activity that has become increasingly widespread in programming-language education is live programming. During live programming, an instructor writes code and executes it in a way that is visible to learners through a screen (Guzdial and Barr 2013;Selvaraj et al. 2021). One benefit of this practice is that typos and mistakes occur, normalizing them for learners. Watching an instructor handling errors, demonstrates learners on how to solve them when they are outside the classroom (Shannon and Summet 2015;Nederbragt et al. 2020). When coauthor McTavish was a postdoc, teaching an introductory programming workshop as a volunteer with the Carpentries (a nonprofit group that teaches foundational coding and data science skills to researchers worldwide; Wilson 2006Wilson , 2022, a senior faculty member taking the workshop complained that the typos were slowing things down and interfering with the pedagogy. McTavish replied "the typos ARE the pedagogy. " This has become a slogan of sorts at the Carpentries, capturing the idea that embracing and discussing mistakes is essential to teaching programming (Wilson 2019). Yet, working through mistakes is rarely done on written pedagogical materials. Software documentation focuses on demonstrating usage function with examples that work seamlessly, without errors. We argue that the opposite is needed to support adoption of reproducible workflows and support long term independence in learner's and user's performance (Gaspar and Langevin 2007;Steele-Johnson and Kalinoski 2014). In our tutorials, we apply this principle by demonstrating examples that do not work as expected, and exemplifying ways to address issues ( Figure 2). For example, we identified inputs that would give a wide range of warnings and errors. We then focus on providing explanations for these errors and warning messages. We believe this supports users and learners to be less intimidated by the messages, and to practice taking useful information out of them.
We also demonstrate ways to evaluate inputs to determine if they will trigger an error or warning, and design and demonstrate alternative analysis routes on what to do when faced with an error or warning. One of the most essential skills in programming is interpreting and moving forward from errors. This has two pedagogical benefits. First, it provides users and learners with the means to troubleshoot their own warnings and errors. Second, it allows them to understand with more depth what the code is doing.

Conclusion
Response from the community has been invaluable in gauging success of our teaching materials. Senior researchers often comment on the usefulness of the tutorials for their research, as well as how they have supported students in using the demonstrated R packages more independently. We note that making approachable research workflows has all the advantages of reproducible research workflows. It saves time during explanation and training, when analyses are run by new collaborators and students. It can also save research time for yourself, when analyses are run again with more data, a different dataset, a different organism or biological model. It contributes to making scientific efforts that can build off of each other.
When developing our tutorials, we not only applied the principles elaborated here to make them more approachable, but we followed basic recommendations for successful reproducibility (Sandve et al. 2013). Applying all these principles in a tutorial not only teaches reproducibility, but also makes the teaching process itself reproducible (Dogucu and Cetinkaya-Rundel 2022). For example, we published the tutorials on a persistent and public website (Sánchez Reyes, McTavish, and Holder 2021) that adheres to the four r's of openness, by being free license, free of cost, and thus free for use, reuse, redistribute, revise and remix (Hilton III et al. 2010). To make the website persistent, any updates to the tutorial are published as new versions. Versions presented at workshops are a copy from the original repository, and constitute a temporally stable snapshot of functions and workflows presented during a live workshop (Wilson 2006(Wilson , 2022. This ensures that the tutorials are available for the users to return to any time they need it, and to be shared with other users and learners ( Figure 3). Finally, we dedicate a section of the tutorial to document the software versions that are demonstrated throughout. A common issue in open source software packages written in R and Python is the deprecation of functions (i.e., functions that are no longer reccommended or maintained, and that are in the process of being phased out and replaced by new ones; Marks et al. 2017;Vadlamani, Kalicheti, and Chimalakonda 2021). Running code that has fallen victim to deprecation require considerable programming savvy, and it is a task that is hard even for experienced programmers (Vadlamani, Kalicheti, and Chimalakonda 2021). Yet, old versions of open source software packages remain digitally accessible, and persist in time. If a package version is known, an analysis can be reproduced without investing additional resources in finding the appropriate functions to run it, even long time after a function had been deprecated.
Incorporating the principles described here into teaching resources not only improves reproducibility practices, but it should facilitate adoption of software and analysis workflows in the natural sciences, among researchers at different academic levels, from undergrads to established researchers. It can also help close the academic gap that is generated by uneven access to computational resources across students belonging to different groups (KewalRamani et al. 2018), where the most affected learners usually belong to underrepresented minorities and rural areas (Warner et al. 2021). Differences in access can also be due to gender-biased parental and community pressures, in which male identifying individuals are more likely to be encouraged to perform activities related to computers (Google Inc. and Gallup Inc. 2016), while female identifying individuals are discouraged, starting from as early as elementray school (Master, Meltzoff, and Cheryan 2021).
Some universities and academic groups have started to incorporate reproducibility as a subject into their curriculum (NIGMS Career Curriculum Development 2015; Debruine and Taylor 2019; University of Washington Libraries 2022). The focus of these resources has been for students to acquire and practice skills to document their work. The principles identified and outlined here can be used to set learning goals and outcomes for new reproducibility syllabi. Ultimately, the long term improvement of reproducibility rates in science will depend on our ability to intentionally integrate best practices for achieving reproducibility into the educational framework of future data scientists (National Academies of Sciences, Engineering, and Medicine 2018). Inclusion of reproducibility into the data acumen of undergraduate curriculum will provide college learners and future researchers with the tools to develop the fundamental skills needed to successfully create reproducible scientific workflows and research products.