A New Era of Learning: Considerations for ChatGPT as a Tool to Enhance Statistics and Data Science Education

Abstract ChatGPT is one of many generative artificial intelligence (AI) tools that has emerged recently, creating controversy in the education community with concerns about its potential to be used for plagiarism and to undermine students’ ability to think independently. Recent publications have criticized the use of ChatGPT and other generative AI tools in the classroom, with little focus on the potential benefits. This article focuses on the potential of ChatGPT as an educational tool for statistics and data science. It encourages readers to consider the history of trepidation surrounding introducing new technology in the classroom, such as the calculator. We explore the possibility of leveraging ChatGPT’s capabilities in statistics and data science education, providing examples of how ChatGPT can aid in developing course materials and suggestions for how educators can prompt students to interact with ChatGPT responsibly. As educators, we can guide the use of generative AI tools in statistics and data science classrooms so that students and educators can leverage the benefits of this technology.


Introduction
The introduction of the calculator in the mid-20th century had a significant impact on mathematics education (Ellington 2003;Watters 2015). Initial reactions to this technology were mixed, with some educators and mathematicians expressing concern about the potential negative effects of calculators on students' ability to perform basic arithmetic operations and understand mathematical concepts (Savage 1986;Ellington 2003;Watters 2015). Others, however, saw the calculator as a valuable tool that could enhance students' understanding of mathematics and enable them to solve more complex problems (Savage 1986;Watters 2015). Despite the controversy surrounding the use of calculators in mathematics education, the technology has become ubiquitous in classrooms around the world and is now considered an essential tool for students at all levels of mathematical learning (Ellington 2003).
In 2015, Watters referred to calculators as "arguably one of the most controversial pieces of education technology to enter the classroom" (Watters 2015). The end of 2022 brought a new contender for the most controversial technology in the classroom, ChatGPT. ChatGPT is one of many generative artificial intelligence (AI) tools, working like a chatbot to generate text responses to user-provided prompts. Like the calculator, the emergence of ChatGPT has quickly created controversy in the educational landscape. Recent articles boast titles such as "ChatGPT: Bullshit spewer or the end of traditional assessments in higher education?" (Rudolph, Tan, and Tan 2023), "Chat-GPT: The end of online exam integrity?" (Susnjak 2022), "Chatting and cheating: Ensuring academic integrity in the era of ChatGPT" (Cotton, Cotton, and Shipway 2023), and "ChatGPT user experience: Implications for education" (Zhai 2022). The message is clear: educators need to take note of ChatGPT, but what note should they take?
The initial response to ChatGPT from the education community has been largely cautionary; some educators and universities are taking steps to mitigate its use in academic settings. One of the primary concerns has been the potential for students to use ChatGPT to plagiarize or generate work that is not their own, leading to concerns about academic integrity (Rudolph, Tan, and Tan 2023;Susnjak 2022;Zhai 2022;Cotton, Cotton, and Shipway 2023). To address these concerns, educators are implementing various tactics, including using tools to detect the use of ChatGPT or implementing alternative assessment plans (Susnjak 2022;Cotton, Cotton, and Shipway 2023;Rudolph, Tan, and Tan 2023). Another concern about ChatGPT in the education landscape is that its use will lead to students being unable to think independently or detect if and when AI-generated text is incorrect (Baron 2023). As an artificial intelligence tool, there are also broader ethical concerns that apply to material generated by ChatGPT, such as the possibility of propagating biases present in source material into responses generated by ChatGPT (Dwivedi et al. 2023). Many others have discussed the potential negative implications of ChatGPT in education (Baron 2023;Cotton, Cotton, and Shipway 2023;Dwivedi et al. 2023;Gilbard 2023;Hirsh-Pasek and Blinkoff 2023;Rudolph, Tan, and Tan 2023;Weissman 2023); as such, that is not the focus of this article.
Rather than focusing on the cautionary tales of generative AI tools, this article focuses on the potential of ChatGPT as an educational tool for statistics and data science. We encourage readers to consider the history of trepidation amongst educators that has surrounded the emergence of new technology, such as the calculator, WolframAlpha, and Wikipedia, all of which caused concern when first introduced but are now commonly used as learning tools (Young 2009;Watters 2015;D' Agostino 2022). While generative AI tools are different in many ways from these technologies, they share the commonality that they have been a controversial addition to the set of tools available to both students and educators. In this article, we first describe Chat-GPT and how to use it, then provide suggestions and examples for leveraging ChatGPT in statistics and data science courses. We then challenge the reader to think about the potential benefits of ChatGPT and how educators can incorporate ChatGPT in the classroom and train our students to use it responsibly.

What Is Generative AI and ChatGPT?
Generative AI is a type of artificial intelligence that is designed to create new content, such as images, music, and text. These systems use deep learning techniques to analyze patterns and relationships in large datasets, allowing them to produce novel and creative output. One type of generative AI that has gained significant attention in recent years is large language models. These models are trained on vast amounts of text data and are specifically designed to process and understand human language (Wolfram 2023).
ChatGPT ("GPT" stands for "Generative Pre-trained Transformer") is a large language model generative AI tool that was introduced by OpenAI in November of 2022. It works like a chatbot, generating text responses to user-provided prompts. Users can request simple text responses or responses that are complex such as programming code, sonnets, entire essays, and mathematical theorems. The generative AI landscape is changing rapidly, with new tools regularly becoming available.
The focus of this article is on ChatGPT due to its widespread popularity, but the general principles in this paper apply to other generative AI tools as well.

Capabilities of ChatGPT
ChatGPT has a wide range of capabilities in generating text output, and educators should be aware that ChatGPT can generate output that sounds remarkably human-like. The conversational nature of ChatGPT is modeled in, "A Conversation on Artificial Intelligence, Chatbots, and Plagiarism in Higher Education" (King and ChatGPT 2023). Others have also investigated the mathematical capabilities of ChatGPT, including its ability to solve complex mathematical problems and write proofs (Frieder et al. 2023).
To demonstrate the capabilities of ChatGPT for answering questions about introductory statistics content, we provided three prompts asking for the definition of a confidence interval and three prompts asking for the definition of a p-value. Both are concepts that learners of statistics find challenging. The responses provided by ChatGPT to these prompts about confidence intervals and p-values are provided in Tables 1 and 2, respectively. The phrasing of each prompt is slightly different to illustrate the variety of responses that ChatGPT can generate to similar prompts. Repeatedly inputting the same prompt can also lead to different generated text; the responses provided in Tables 1 and 2 were the first responses generated by ChatGPT at the time of inquiry.
While each of the examples presented in Tables 1 and 2 asks ChatGPT to explain the same concept, the generated responses differ in length, style, and approach to the explanation. Because ChatGPT is trained on a large corpus of existing text data such as books, articles, and websites, inaccuracies written in these sources can be passed to ChatGPT responses. For example, a Table 1. Example prompts provided to ChatGPT and resulting responses generated regarding the definition of a confidence interval.

Prompt
ChatGPT response

Give the definition of a confidence interval
A confidence interval is a range of values calculated from a sample of data that is used to estimate an unknown population parameter. The interval is derived from the sample mean and a measure of variability, such as the standard deviation or standard error, and is expressed as a range of values that has a specified level of confidence associated with it. For example, if a 95% confidence interval is calculated for a population mean, it means that if we were to repeatedly sample the population and calculate a 95% confidence interval for each sample, we would expect 95% of those intervals to contain the true population mean.
The width of the confidence interval is determined by the sample size, the variability of the data, and the desired level of confidence. In general, as the sample size increases or the level of confidence increases, the width of the interval will decrease.

Give the definition of a confidence interval using non-technical language
A confidence interval is an estimate of the range of values that are likely to contain a certain statistic about a population, based on a sample of that population. It's like making an educated guess about what the true value of that statistic might be for the entire population, based on the data that we have.
Think of it this way: imagine you want to know the average height of people in a certain city. You can't measure the height of everyone in the city, so instead you measure the height of a smaller group of people, which is your sample. From that sample, you calculate a range of values that is likely to contain the average height of the entire city, and that range is your confidence interval.
The level of confidence associated with a confidence interval represents the likelihood that the true value of the population statistic is contained within the interval. For example, if a confidence interval has a 95% level of confidence, that means that if you were to repeat this process many times, 95% of the intervals you calculate would contain the true value of the population statistic.

Give a response to the homework question, what is the definition of a confidence interval
A confidence interval is a range of values calculated from a sample of data that is used to estimate an unknown population parameter. It provides an interval of values that is believed to contain the true population parameter with a specified level of confidence. The level of confidence is expressed as a probability, usually expressed as a percentage, that the interval calculated from the sample data will contain the true population parameter. Confidence intervals take into account both the sample size and the variability of the data in the sample, and help to provide a measure of the uncertainty associated with the estimate of the population parameter. Table 2. Example prompts provided to ChatGPT and resulting responses generated regarding the definition of a p-value.

Give the definition of a p-value
A p-value is a statistical measure that represents the probability of obtaining a result as extreme as, or more extreme than, the observed result, under the assumption that the null hypothesis is true.
The null hypothesis is a statement about a population parameter, such as the mean or proportion, that we want to test using a sample of data. The p-value is calculated by comparing the observed test statistic, which measures the difference between the sample and the null hypothesis, to the distribution of the test statistic under the null hypothesis.
If the p-value is small (typically less than 0.05), it suggests that the observed result is unlikely to occur by chance if the null hypothesis is true, and we reject the null hypothesis in favor of an alternative hypothesis. On the other hand, if the p-value is large (greater than 0.05), it suggests that the observed result is consistent with the null hypothesis, and we fail to reject the null hypothesis.
In other words, the p-value provides a measure of the strength of evidence against the null hypothesis and helps us make decisions about whether to accept or reject the null hypothesis.

Give the definition of a p-value using non-technical language
A p-value is a number that tells you how likely it is that your research finding happened by chance. It's like a measure of how surprising your results are. When you do a statistical test, you compare your actual results to what you'd expect if there was no effect or difference between groups. If your p-value is small (less than 0.05), it means that your results are unlikely to have occurred by chance and there may be a real effect or difference. If your p-value is large (greater than 0.05), it means that your results could have happened by chance, and there may not be a real effect or difference. So, a p-value helps you decide whether to accept or reject your research hypothesis.

Give a short definition of a p-value
A p-value is a measure of the probability of obtaining a result at least as extreme as the one observed, assuming that the null hypothesis is true.
common misinterpretation of a frequentist confidence interval is to treat it as a probability statement about the true value of the population parameter, incorrectly treating the population parameter as a random variable rather than a fixed but unknown value. One sentence in ChatGPT's response to the second prompt ("The level of confidence associated with a confidence interval represents the likelihood that the true value of the population statistic is contained within the interval") somewhat alludes to this misinterpretation (Table 1). However, it does follow up with a correct interpretation in the following sentences. When defining a p-value, a common misinterpretation is that the p-value represents the probability of your research finding occurring by chance-a statement that ChatGPT provides in response to the second prompt. This definition could be true if it was understood under the assumption of the null hypothesis being true, but ChatGPT does not provide this qualifier. Although there are inaccuracies in ChatGPT-generated responses, the responses in Tables 1 and 2 show that ChatGPT is capable of successfully defining these statistical concepts using a range of language styles (both technical and non-technical). It is important that statistics and data science educators teach students to recognize that the statements generated by ChatGPT may contain accuracies and inaccuracies, and a fruitful classroom exercise may consist of guiding students in how to critique ChatGPT's responses. Despite its wide-ranging capabilities, there are some types of tasks that ChatGPT may not perform well. For example, asking ChatGPT to generate references may lead to a reference list that is completely fabricated (King and chatGPT 2023). ChatGPT also cannot accommodate non-text prompts such as images. Occasionally, ChatGPT responds very slowly, and sometimes it cannot be used at all, as it may indicate a message such as, "We're experiencing exceptionally high demand. Please hang tight as we work on scaling our systems. " ChatGPT lists its own limitations as: "may occasionally generate incorrect information", "may occasionally produce harmful instructions or biased content", and "limited knowledge of world and events after 2021" (OpenAI 2023). ChatGPT does not have common sense knowledge or the ability to understand context beyond what it has learned from its training data. Therefore, it may not always be able to provide accurate or appropriate responses to prompts that require a deep understanding of the context beyond the scope of its training data. It also may be unable to tailor instruction to an individual student as effectively as a human teacher. As with any AI model, it is essential to be cautious and critical of its outputs and use it as a tool rather than a definitive source of information.

Leveraging ChatGPT for Statistics and Data Science Education
There are several ways that educators can consider leveraging ChatGPT in statistics and data science education. One of these methods is utilizing ChatGPT to generate course content, which could include supplements to educator-derived material, such as prompting ChatGPT to create a summary of lecture notes (Roose 2023;Staake 2023). It could also generate new material, such as practice quizzes or exam questions (Roose 2023;Staake 2023). For example, when providing ChatGPT the following prompt, "Write three free-response exam questions about sample size and power for a college-level introductory statistics class, " it generated the following: Note that some ChatGPT responses are more appropriate than others, and we do not condone blindly using ChatGPT-derived materials without critically assessing whether these materials meet assessment goals for student learning. However, even in cases where ChatGPT responses are not meeting the educator's goals, an iterative process can be used to refine ChatGPT responses. For example, a prompt may be revised to ask for something that is shorter or longer, written in a specific style, or written without some part of the response that was undesirable. In our experience, ChatGPT performs well when provided with a specific prompt that is then refined by the user as needed, as opposed to a traditional search engine that may not perform well with an overly-specific or verbose input. Educators can also manually revise ChatGPT responses to meet their needs or leverage ChatGPT to generate general ideas that are then used for educator-developed materials. For example, when providing ChatGPT the following prompt, "Give me an example of a biomedical variable that follows a Poisson distribution, " it generated the following: "A Poisson distribution is a probability distribution that can be used to model count data, such as the number of occurrences of an event in a given time or space. In a biomedical context, an example of a variable that can follow a Poisson distribution is the number of infections acquired by patients in a hospital. The number of infections can be modeled as a Poisson process, where the mean rate of infections depends on various factors such as the type of infection, the patient population, and the quality of healthcare. By modeling the number of infections as a Poisson variable, healthcare professionals can estimate the likelihood of different infection scenarios and implement measures to reduce the risk of infections in the hospital. " This response provides much more information than was requested. Still, if the educator's goal was to get an idea of a health-related count variable to use in an example about a Poisson distribution, the goal was achieved-they could consider using "number of infections" as the example. As previously mentioned, critically thinking about ChatGPT responses and manually revising as necessary is an important part of the process. For example, the educator may recognize (as ChatGPT failed to mention) that counts of hospital infections may be overdispersed and more appropriately modeled using a negative binomial distribution than a Poisson distribution.
Statistics and data science educators may also find Chat-GPT helpful for developing materials for courses focused on statistical programming or as a learning tool for students in programming courses. For example, an educator can ask Chat-GPT to write pseudocode for a given problem, which then can be used as a teaching example to help students who are new to statistical programming. An educator could also provide an excerpt of code to ChatGPT and ask it to write the code in a different language; for example, to convert it from SAS to R or from R to Python. Statistics and data science educators may also wish to guide students in how to use ChatGPT to debug their code, a topic that has been covered by others (Jain 2022;Rollbar Editorial Team 2023). Of note, several other AI tools also exist to assist in code writing and debugging, and this area is rapidly evolving as new tools become available regularly (Phillips 2023).
Educators may also wish to test their assignments or exam questions on ChatGPT to be aware of what students could derive if they utilized ChatGPT to answer the questions themselves. Educators may find that ChatGPT is surprisingly adept at providing short answers to free-response questions that ask students to explain statistical concepts, such as, "Briefly explain one disadvantage of utilizing complete case analysis to analyze a study with missing data. " As previously mentioned, there are other types of tasks that ChatGPT is less capable (or sometimes incapable) of performing well, such as interpreting statistical output that is provided as an image.
Tools exist to assess student learning while ensuring students are not using ChatGPT or other generative AI tools, such as programs blocking access to specific websites or the entire internet. However, we echo the sentiments of many others suggesting that educators consider an alternative approach to address how students can leverage generative AI to enhance learning directly (Roose 2023, Abramson 2023, Duckworth and Ungar 2023Staake 2023). Doing so mimics the environment that students will be in outside of the classroom so that they can learn to responsibly use the tools that they will have available to them. For a free-response task, this could include a three-part question that asks students to: (a) answer a prompt in their own words, (b) feed the prompt into ChatGPT and copy the response, and (c) analyze how their response differs from ChatGPT's response and how they can assess whether the differences are "correct. " For a statistical programming class, an example could include asking students to do the following: (a) prompt ChatGPT to write some R code to perform a specific task, (b) run the ChatGPTgenerated code in R, and (c) assess whether the code worked as intended and explain how they determined whether or not it worked, or alternatively, explain line by line what the generated code is doing. Incorporating exercises that require students to critique ChatGPT provides instructors the opportunity to help students navigate responses generated by ChatGPT. Such exercises can be leveraged to create classroom discussions where students share their responses, and the instructor can facilitate conversations around responsibly using AI tools and how to critique the generated responses.

Reflection
We recognize that educators have valid concerns regarding the implementation and integration of AI tools in the classroom, and a full discussion of these issues can be found elsewhere (Baron 2023;Cotton, Cotton, and Shipway 2023;Gilbard 2023;Hirsh-Pasek and Blinkoff 2023;Rudolph, Tan, and Tan 2023;Weissman 2023). In addition to the concerns related to the classroom use of AI there are legal and ethical concerns which have arisen with the introduction of ChatGPT (Karim 2023;Woodie 2023;Zhou et al. 2023). Despite these concerns, the popularity of AI tools is increasing. Many universities are establishing AI-related committees or utilizing existing ethics committees to provide guidance on issues of academic integrity as AI tools continue to evolve and advance (Grove 2022;. We encourage readers to consider other technologies, such as the calculator, Wolfram Alpha, and Wikipedia, all of which were met with initial wariness but are now commonly used as learning tools. As statistics and data science educators, we can actively shape and guide the incorporation of AI tools within our classrooms. This article highlights several avenues for responsibly leveraging ChatGPT in statistics and data science education. Finally, we invite the reader to consider that two paragraphs of this manuscript were generated by ChatGPT. Can you tell which ones? To some, this exercise may highlight how dangerous ChatGPT can be to academic integrity. On the other hand, this exercise demonstrates how ChatGPT can be utilized effectively by feeding it a specific prompt, thinking critically about whether the generated response meets the user's goals, and revising the prompt if necessary. (For those looking for the answer, we used ChatGPT to generate the first paragraph of the Introduction section and the first paragraph of the "What is Generative AI and ChatGPT?" section using the following two prompts: (a) "Write a short paragraph in the style of academic writing that describes the initial reactions to the impact of the calculator on mathematics education, " and (b) "In the style of academic writing, briefly explain to a layperson what generative AI and large language models are. ") The responses from Chat-GPT were minimally edited for flow, and references were added by the authors.

Data Availability Statement
Data sharing is not applicable to this article as no new data were created or analyzed in this study.