Grasping Psychological Evidence: Integrating Evidentiary Practices in Psychology Instruction

Abstract The spread of misinformation has underscored the importance of cultivating citizens’ competency to critically evaluate popular accounts of scientific evidence. Extending the prevailing emphasis on evidence in the natural sciences, we argue for fostering students’ understanding of psychological evidence and its communication in the media. In this study, we illustrate how this goal can be advanced in undergraduate psychology instruction by actively engaging students in evidence evaluation and design. We employed the Grasp of Evidence framework to document students’ evidence evaluation ideals and processes and how these changed over a course in which students engaged in a series of collaborative evidence evaluation and design tasks. Prior to instruction, students exhibited a mechanistic understanding of scientific methods, coupled with substantial reliance on personal experience. Following instruction, students demonstrated three key shifts in grasp of evidence: a shift in perception of the sources of psychological knowledge, a shift in views of scientific objectivity, and a shift in definitions of psychological phenomena. Analysis of students’ collaborative discourse highlighted three design principles that supported increasingly complex understandings of psychological evidence: collaborative critique and redesign of flawed studies, engagement with diverse sources of popular evidence, and confronting elusive conceptual constructs.


Introduction
The spread of misinformation and disinformation in the "post-truth" era has highlighted the importance of cultivating a critical and literate citizenry (Barzilai & Chinn, 2020;Britt et al., 2019;Kahne & Bowyer, 2017;Kienhues et al., 2020).An essential part of this effort is promoting citizens' competencies to evaluate accounts of evidence they encounter online, so that they can become "competent outsiders" who are able to critically engage, together with other people, with reports by experts (Feinstein, 2011;Feinstein & Waddington, 2020).However, evaluating scientific evidence is a considerable challenge to students and laypersons because of the complex, tentative, and social nature of evidence production and evaluation (Duncan et al., 2018;Kienhues et al., 2020).
To date, the majority of research on students' evidence evaluation practices has focused on evaluation of evidence from the natural sciences (e.g., McNeill & Berland, 2017;Samarapungavan, 2019), and relatively little attention has been given to how students engage with evidence from the social sciences in general, and psychological research specifically.This is problematic because such evidence has become a staple of popular accounts of human behavior and often informs public policy (e.g., Einfeld, 2019;Peterson, 2015).Moreover, evidence in psychological research has unique attributes compared to evidence in the natural sciences, as it is based on studying complex and unobservable everyday phenomena in ways that are amenable to empirical examination and replicable by other scientists (Bhattacherjee, 2012;Elmes et al., 2011).Attentiveness to these characteristics of psychological evidence is particularly challenging as students often fail to appreciate the scientific foundations of psychology (Amsel et al., 2014;Lyddy & Hughes, 2012), and struggle to coordinate their first-hand experiences with indicators of reliable scientific evidence (McCrudden et al., 2016;Thomm et al., 2021).However, we currently know little about how psychology students evaluate psychological evidence and how their evaluation competence develops over the course of their academic studies.
Introductory psychology courses provide a unique opportunity for promoting evidence evaluation practices.These courses are highly attended by diverse students, most of whom will not continue to advanced psychology degrees or academic careers.Thus, these courses have the potential to expose a large body of students to evidence evaluation practices that could help them in their future encounters with psychological evidence in their everyday lives.However, introductory psychology courses are typically content-centered and focus less on understanding how psychological evidence is produced and critiqued (Gurung et al., 2016;LaCosse et al., 2017).Accordingly, the aims of this study were to examine undergraduate psychology students' epistemic ideals and processes for evaluating popular psychological evidence, and to illuminate how these ideals and processes change during and following a course that provides opportunities for engagement in collaborative evidence evaluation and design tasks.
The paper starts by discussing the features of lay understanding of scientific evidence based on the Grasp of Evidence framework (Duncan et al., 2018) and the AIR model of epistemic thinking (Chinn et al., 2014;Chinn & Rinehart, 2016).We then describe the challenges of evaluating psychological evidence and the specific challenges of engaging students in evidence evaluation in large introductory psychology courses.After presenting the context of the study and the epistemological orientation of the course, we focus on exploring three key shifts in students' grasp of psychological evidence: shifts in grasp of sources of psychological knowledge, shifts in views of scientific objectivity, and shifts in definitions of psychological phenomena.We conclude by suggesting design principles for engaging students in disciplinary productive epistemic practices.

Developing a lay grasp of psychological evidence
Evidence evaluation poses unique challenges to students and laypersons because they typically lack the extensive disciplinary and methodological knowledge that is needed to understand and evaluate study designs and measures (Duncan et al., 2018;Keren, 2018).Thus, lay understanding of scientific evidence is inevitably bounded (Bromme & Goldman, 2014).In light of this challenge, Duncan et al. (2018) proposed a theoretical framework, called the Grasp of Evidence (GoE) framework, which unpacks the dimensions of evidence that are relevant for preparing students to engage with evidence as outsiders to science.Duncan et al. (2018) argue that to meaningfully engage with science, laypersons need to grasp two aspects of evidence.The first aspect is lay grasp of experts' evidentiary practices, that is, of how experts gather, understand, evaluate, interpret, and integrate evidence.The second aspect is grasp of laypersons' evidentiary practices, that is, of how laypersons can meaningfully use evidence despite their bounded understanding of science.This involves grasping the division of cognitive labor in society and knowing how to identify trustworthy sources of evidence and patterns of agreement in the expert community.Lay evidence evaluation practices rely on "second-order" evaluation of source trustworthiness and degree of consensus.For non-experts, such evaluation practices are usually more reliable than "first-order" evaluation of complex claims and evidence (Bromme & Goldman, 2014;Keren, 2018;Osborne et al., 2022).Nonetheless, understanding of expert practices can help laypersons know why and when scientific sources and reports can be trusted (Duncan et al., 2018;Kienhues et al., 2020).
The GoE framework identifies four dimensions of grasp of experts' evidentiary practices (evidence analysis, evidence evaluation, evidence interpretation, and evidence integration) and one dimension that focuses on lay use of evidence reports.Within each of these dimensions, Duncan et al. (2018) identified specific epistemic aims, epistemic ideals, and reliable epistemic processes for reasoning with and about evidence.These components are derived from the AIR model of epistemic thinking (Chinn et al., 2014;Chinn & Rinehart, 2016): (a) Epistemic aims are the epistemic products or outcomes that people set to achieve.For example, according to Duncan et al. (2018), the epistemic aim of expert evidence evaluation is determining if evidence is of high quality and whether conclusions can be trusted-e.g., if the conclusions follow from the results.In contrast, the aim of lay use of evidence is determining the credibility of scientific claims in everyday communication, that is, determining if scientific reports can be trusted.
(b) Epistemic ideals are the criteria or norms that are used to evaluate whether epistemic aims have been achieved and the quality of epistemic products.For example, some epistemic ideals of expert evidence evaluation are the appropriateness of the methods and measures and the validity of the conclusions.In contrast, lay evidence ideals can involve, for example, the trustworthiness of the sources that provide the reports (e.g., their degree of expertise and bias) and the validation of the reports by knowledgeable others (e.g., if the study has undergone expert peer-review).
(c) Reliable epistemic processes are the procedures, strategies, or methods that people can use to successfully achieve epistemic aims.Reliable processes of expert evidence evaluation involve, for example, scrutinizing the appropriateness of the study design (e.g., appropriateness of samples and comparisons) and thinking of alternative explanations for the results.In contrast, reliable processes of lay use of evidence can involve, for example, engaging in critical evaluation of sources and publication platforms and determining if there is consensus in the scientific community or if the claims are controversial (Duncan et al., 2018).
Although Duncan et al. developed the GoE framework mainly in the context of the natural sciences and K-12 education, their conceptual analysis of the dimensions of grasp of evidence is also highly relevant to evidence in the social sciences and in higher education.Indeed, Duncan et al. aimed to described lay reasoning about evidence at a "mid-level of domain-specificity" that "can generalize across genres of scientific activity" (Duncan et al., 2018, p. 913).Nevertheless, it is possible that disciplinary differences and increasing complexity could shape the specific aims, ideals, and processes that are valued in each dimension and their enactments.The GoE framework does "not claim to specify all aims, ideals, or reliable processes that are useful in a given dimension" (Duncan et al., 2018, p. 917).Rather, Duncan et al. provided examples of key aims, ideals, and processes and acknowledged that these might vary across domains.Further, the lay dimension of evidence use can reflect people's everyday goals, assumptions, and interactions with evidence in context as they pursue issues that matter to them.Hence, in this study, we use the GoE framework as an open-ended analytical tool to depict psychology undergraduates' understanding of evidence and to analyze its development.
We specifically focused on examining growth along two dimensions of the GoE framework, evidence evaluation and laypersons' use of evidence, as these unfolded along an introductory psychology course.We scrutinized the epistemic ideals and processes that students employed in each of these dimensions in order to clarify how students grappled with the aim of evaluating the credibility and quality of psychological evidence from both disciplinary and lay perspectives.Our analysis is grounded in the assumptions that epistemic practices, including methods and standards of gathering and evaluating evidence, are thoroughly social, developing within epistemic communities and dependent on the particulars of tasks and contexts (Longino, 1990;Chinn & Sandoval, 2018).In this sense, our documentation of the development of students' epistemic ideals and processes reflects their initiation into a particular community of psychological research.
However, we believe that this analysis may also contribute to understanding of the development of grasp of psychological evidence more broadly.

The challenges of evaluating psychological evidence
As mentioned, prior research on evidence evaluation has mostly focused on the natural sciences.Yet, psychological evidence has come to play a rising role in public policy (e.g., Einfeld, 2019;Peterson, 2015) and its evaluation merits further inquiry.Evidence in psychological research has some specific attributes that pose challenges for students.
First, students often fail to appreciate, or even actively doubt, the scientific basis of psychology as the study of human behavior (Amsel et al., 2014;G€ uss & Bishop, 2019).Many students experience difficulties in accepting the relevance of rigorous scientific methods to the study of psychological phenomena compared to biological ones (Estes et al., 2003).This tension is accentuated by the fact that psychology students often perceive the scientific study of such phenomena as disconnected or even as a digression from the core clinical features of psychology as a professional practice (Balloo et al., 2018).In this respect, students may struggle to coordinate scientific principles for evidence evaluation, adopted by psychological research communities, and their personal knowledge about psychology, based on their accumulated personal experiences and popular conceptions (Holmes & Beins, 2009;Lyddy & Hughes, 2012).
Second, students may grapple with the sources of objectivity in psychological research.A common depiction of objectivity is that it is the negation of subjectivity; that is, that science is objective to the extent that it remains value-free (Douglas, 2007(Douglas, , 2009)).However, researchers' preferences and values are intertwined in every step of the scientific process of conceptualizing, gathering, and interpreting evidence (Reiss & Sprenger, 2020).As Douglas argues, objectivity does not require rejecting values; rather, it arises from the using trustworthy processes for interacting with the world, from reasoning about data in impartial ways, and from social processes of validation, agreement formation, and debate (Douglas, 2009).For instance, one important source of objectivity described by Douglas is convergent objectivitywhen a phenomenon can be similarly described via different and independent methods.Objectivity can also arise from social processes such as procedural objectivity which "occurs when a process is set up such that regardless of who is performing that process, the same outcome is always produced" (Douglas, 2007, p. 134).The importance of procedural objectivity has received increased attention with the emergence of the "replication crisis" in psychology, which has focused public attention on failures to reproduce research results (Maxwell et al., 2015;Shrout & Rodgers, 2018).This crisis has highlighted the importance of addressing the social sources of scientific objectivity that ensure that evidence will be "critically examined, restated, and reformulated before becoming part of the scientific cannon" (Longino, 1990, p. 68).In this paper, we adopt a complex view of scientific objectivity that recognizes researchers' perspectives and values and that considers objectivity as arising from the practices that scientific communities adopt to improve the reliability of their inferences (Longino, 2019;Reiss & Sprenger, 2020).
Third, and relatedly, students often have a hard time appreciating how psychological knowledge is defined and measured.Compared to the natural sciences, measurement processes in psychology are often less direct (e.g., measuring stress via heart rate).Psychological research tackles this obstacle via an emphasis on the validity of psychological constructsformulating operational definitions that are both amenable to empirical examination and replicable by other scientists (Elmes et al., 2011;Payne, & Westermann, 2003;Vazire et al., 2022).Although the replication crisis initially led to a focus on faulty statistical practices, researchers have noted that failures to replicate often have to do with the validity of psychological constructs and measures (Cred e et al., 2017;Vaidis & Bran, 2019;Yarkoni, 2020).
These challenges may be exacerbated by the ways in which psychological evidence is communicated in the media.The everyday relevance of psychological phenomena has led to the emergence of an industry of popular books, magazines, blogs, talks, and podcasts centered on communicating findings from psychological research (e.g., Duhigg, 2012;Epstein, 2019;Gladwell, 2006).These accounts typically simplify reports of evidence and present them as more clean-cut and applicable to everyday situations than the research findings they are based on.Moreover, when students read simplified and popularized scientific accounts, they may tend to underestimate their dependence on experts (Scharrer et al., 2017).
To conclude, learning to evaluate the credibility of psychological evidence involves multiple challenges.Yet, there have been very few studies so far that have examined the evaluation of popular communication of psychological evidence.Moreover, while researchers have examined how students approach the discipline as a whole (e.g., Amsel et al., 2014), these studies have not examined in detail students' own epistemic ideals and processes for evaluating psychological evidence, or how these develop over instruction.
Introductory psychology courses could provide an opportunity for cultivating epistemic practices such as evidence evaluation.However, such courses have traditionally prioritized lecture-based coverage of content knowledge in the various domains of psychology (for large groups of students) at the expense of active engagement with the scientific rationale underlying psychological research (Gurung & Hackathorn, 2018;Stevens et al., 2016).To address these challenges, researchers have stressed the importance of explicating the scientific foundations of psychological research (Brewer et al., 1993;Hughes et al., 2013).This entails highlighting and scrutinizing the scientific features of psychological studies and offering meaningful experiences that allow students to appreciate how research findings are underpinned by scientific research methods (Balloo et al., 2018;Gurung & Hackathorn, 2018).However, to the best of our knowledge, there is scant research into whether and how introductory psychology courses can serve as a site for fostering grasp of psychological evidence.

The present study
In light of this background, the objectives of this study were to examine first year undergraduate students' epistemic ideals and processes for addressing the epistemic aim of evaluating popular psychological evidence and to investigate how these evolved during an introduction to psychology course designed to support students' evidence evaluation competence.We focused on an in-depth study of ideals and processes as the epistemic aims that students engaged in (e.g., judging evidence credibility) were largely set by the instructional context.Such an examination can shed light on students' grasp of psychological evidence and on the potential for growth in grasp of evidence following instruction.The questions we explored were: (a) What are undergraduate students' pre-instructional epistemic ideals and processes for evaluating popular psychological evidence?(b) If, and how, do these epistemic ideals and processes change following an introductory psychology course designed to foster evidence evaluation competence?(c) How did epistemic ideals and processes emerge as students engaged in collaborative evidence evaluation and design tasks?
The first and second questions were addressed by analyzing and comparing students' responses to written evidence evaluation tasks that were administered before and after the course.The third question was examined through qualitative analysis of one group's discourse during a series of collaborative online meetings in which students were tasked with evaluating and designing psychological evidence.

Participants
The study was conducted in a first-year undergraduate Introduction to Psychology course in a large university in Israel.Out of the 316 students in the course, 265 students voluntarily consented to participate in the study.In our analyses, we focused on students who completed all of the course tasks and assessments (pre-and post-course evidence evaluation assessment, three small-group online tasks, final exam).In addition, we excluded students whose small group tasks were compromised due to absenteeism on the part of other group members, or technical problems.Thus, the final sample included 120 students (24 groups of 5 students) who completed all course assessments and whose group completed all tasks (70% female, 30% male).This sample was representative of the overall class in terms of gender (67% female, 33% male in the whole class) and in terms of their score on the final exam (83.26 in the whole class, and 83.13 in the sample).

Course design
To introduce students to the foundations of psychological science, the course offered a novel pedagogical structure for blended instruction, with three collaborative online evidence evaluation and design tasks, and ensuing face-to-face meetings centered around students' experiences in these tasks.Overall, the course consisted of eight video lectures covering the basic domains of psychology, four in-class meetings and three synchronous online group tasks (Figure 1).
The course began with an in-class meeting in which students were introduced to the course structure and syllabus.Then, the course was conducted in three analogous cycles.First, students were assigned 2-3 video lectures, which they could watch at their own pace during a predetermined period of time (circa three weeks).Then, students participated in 90-minute synchronous collaborative evidence evaluation and design tasks.In these tasks, groups of five students critically evaluated psychological evidence from popular sources (Table 1).While the course's core content was covered by the online lectures, these tasks and the ensuing in class-meetings focused on overarching features of psychological research that were chosen by Professor A (pseudonym, as are all names in the paper) in light of her research and teaching experience.

Synchronous online group tasks and in-class meetings
The aim of the group tasks was to actively engage students in epistemic practices of evidence evaluation and design.Collaborative engagement in such practices can support meaningful learning because it requires explicating one's thinking, accounting for multiple perspectives, and considering and weighing diverging points of view (Chinn et al., 2018;Kuhn et al., 2008;Ryu & Sandoval, 2015).Further, because epistemic practices are deeply social, the capacity to engage in epistemic performance together with other people is an important goal in and of itself (Barzilai & Chinn, 2018).The synchronous online tasks were conducted in the chat application on the Moodle platform that accompanied the course and were 90 minutes long.In these tasks, students were asked to evaluate accounts of psychological evidence and to formulate more scientifically rigorous ways to study these phenomena (Table 1).These two aspectscritiquing and constructing scientific claimshave been identified as the key building blocks of critical engagement with scientific evidence (Duncan et al., 2018;Ford, 2008).The innovative aspect of the current tasks was their requirement to redesign flawed psychological studies.This move was inspired by prior studies that have demonstrated that introducing low quality sources and evidence can generate critical discussions of evidence quality that promote better grasp of evidence evaluation (Chinn et al., 2018;Rinehart et al., 2016).Complementing evidence evaluation with redesign activities was intended to enable even deeper engagement with evidence by inviting students to design and evaluate their own evidence.The challenge of formulating better methods for studying psychological phenomena was meant to expose the connections between epistemic processes (e.g., collecting data) and their guiding ideals (e.g., reliability).
In each task, students discussed a popular information source (a newspaper piece, a TED talk, and a podcast) about psychological research that was chosen to reflect particular challenges of the scientific study of psychological phenomena.The information sources and the subjects discussed (e.g., emotions, psychotherapy) were intended to reflect authentic psychological evidence students are likely to encounter online in their everyday life.This choice was guided by the assumption that the vast majority of students are not likely to pursue scientific careers, and therefore the course should aim to support a critical evaluation of popular communication of psychological science.The information sources were published a week before the task and the questions on these sources were uploaded to the course's Moodle at the beginning of the class time.Students then had 90 minutes to discuss the evidence and to upload a one-page long collaborative response to the task at the end of the class.
Though the subject matter varied across tasks, their overall structure consisted of three main parts: (1) a question on the disciplinary content the sources engaged with (e.g., "What three types of therapy were introduced in the podcast?"); (2) evaluating the psychological evidence presented in the information source (e.g., "To what extent do you think the algorithm presented in the podcast measures emotions?");(3) formulating better scientific ways to study the psychological phenomenon at hand (e.g., "Formulate an alternative study that explores the phenomenon studied in the original Little Albert experiment") (Table 1).Professor A and her teaching assistant monitored group work by analyzing overall participation patterns (via Moodle), and grading the onepage answers students handed in.In the week following each of these tasks, all students met for an in-person meeting.In these meetings, Professor A discussed the group task and the more general lessons stemming from students' work (the tasks are described in more detail in the findings of RQ3).

Epistemological orientation of the course and researcher positionality
The course's design was led by the course instructor, Professor A, who is an experienced psychology professor, with pedagogical support from the first author.Professor A defined the overarching aim of the course as introducing students to the scientific study of psychological phenomena.Professor A highlighted a quantitative and experimental approach to psychological research, focusing on the methods, procedures, and instruments that would allow for quantifying mental processes in order for them to be generalizable and predictive, while also limiting the scope of inquiry to what can be quantified.Although Professor A took a complex and critical stance toward these methods and repeatedly problematized them, she initiated students into ways of thinking reflective of a particular approach to cognitive psychology.Thus, the course, and this study, do not purport to represent the broad spectrum of psychological research and psychological evidence.
Professor A's premise was that many incoming psychology students view psychology mainly as a clinical profession, both in terms of their personal experiences and their career goals.Accordingly, she strived to expose students to the empirical study of psychology and the research methods that allow studying psychological phenomena in scientifically valid and reliable ways.Professor A approached validity and reliability from a social perspective and emphasized how these concepts are grounded in social processes of communication and critique.Professor A deliberately aimed to problematize students' assumptions regarding the sources of knowledge about psychological phenomena and to show that psychological science is not a simple task of identifying the essence of psychological phenomena.Instead, it is based on the accumulation of psychological evidence that achieves its trustworthiness through social processes of careful operational definition, precise and reflective measurement, replication by the community, and-most importantly-critique.Such processes include both critical engagement with research findings and theoretical views as well as scrutinizing the scientific methods themselves.
In terms of the authors' positionality, we are education researchers, broadly working within the learning sciences with shared interests in students' epistemic development and in learning environment design.Our aim was to examine how students' ideals and processes of evidence evaluation develop within the course's existing framing.Grounded in our social view of epistemic practices (Longino, 1990;Chinn & Sandoval, 2018), our quantitative and qualitative analyses are based on conceptualizing the phenomena in ways that render our analyses as amenable to critique.In our quantitative analysis, we strived to offer codes that can be critically reviewed and consistently applied first within our research team (interrater reliability) and then by other researchers.In our qualitative analysis, we aimed to increase reliability through critical discussions of our interpretations, by responding to alternative interpretations raised by our readers, and by highlighting theoretical and contextual concerns that informed our interpretations.

Data sources
The data sources included an evidence evaluation assessment and students' written discussions in the synchronous online group tasks.

Evidence evaluation assessment
The evidence evaluation assessments were conducted on the first week of the course (pre-assessment) and in the week following the final class (post-assessment).Participants read three adapted and abbreviated popular reports that presented psychological evidence of varying quality (following Rinehart et al., 2016).These included a newspaper advice column, a news report of a scientific study, and an abstract from a scientific journal.Evidence quality ranged from low quality anecdotal evidence to a detailed report of a well-designed study.The pre-and post-assessments addressed different popular psychological topics (pre: child rearing; post: sleep), but the reports were deliberately designed to have parallel source and evidence features and quality (see Table 2 and Appendix A).Participants were asked to explain whether the evidence was reliable and whether the conclusion was valid, and to provide reasons for their evaluations.

Coding of evidence evaluation assessment
The analysis of students' written responses focused on identifying epistemic ideals and processes of evidence evaluation in participants' justifications of their evaluations.The AIR model highlights the importance of epistemic aims-the epistemic products or outcomes that people set to achieve.However, in the context of the pre-and post-assessment, the epistemic aims were set by the task instructions.That is, the assessment positioned the aims of the task as determining the credibility of the evidence and the validity of the conclusions.Consequently, students typically did not mention epistemic aims in their responses.Critically, it was not possible to reliably glean students' aims on the basis of their responses and to determine whether they were motivated by epistemic aims or/and by other aims.Accordingly, our focus in this assessment was on analyzing the epistemic ideals and processes that students employed in responses to the stated aims of the task.
To analyze students' epistemic ideals and processes of evidence evaluation, we iteratively developed a coding scheme that exhaustively captured students' evidence evaluation ideals and processes in the pre-and post-evidence evaluation assessment.The coding scheme was developed both deductively and inductively: we relied on the Grasp of Evidence framework (Duncan et al., 2018) and used it to identify epistemic ideals and processes in students' responses.However, we also paid close attention to participants' voices to identify ideals and processes that were not mentioned in Duncan et al.'s (2018) review.In addition, we were sensitive to instances in which students' uses of ideals and processes subtly differed from the descriptions provided by Duncan and colleagues and attempted to capture these differences.For example, the GoE ideal of valid inference of conclusions (i.e., avoiding conclusions that go beyond what is warranted), was slightly modified in our own coding framework to valid generalization (i.e., avoiding conclusions that go beyond the contexts/populations studied).This new ideal can be understood as a sub-ideal, which reflects the participants' emphasis on generalization from a sample to a population.Arguably, students' focus on generalizations may reflect the importance of sampling in psychological research.We also identified ideals that were not described by Duncan et al. (2018).Some ideals, such as the ideal of clarity of definitions (i.e., defining the phenomenon and variables clearly and precisely) and objectivity (i.e., limiting the influence of researchers' subjective experiences, preferences, or values) appeared to reflect challenges that are more predominant in psychological research, although not exclusive to this domain.Other novel ideals, such as adequate empirical data, seemed to reflect the challenges of empirical research more broadly.
The coding scheme addressed two dimensions of evidence evaluation described by the Grasp of Evidence framework: evidence evaluation ideals and processes that reflect the ideals and processes that scientists might use when evaluating evidence and laypersons' use of evidence ideals and processes that people without specialized disciplinary knowledge might use when they encounter evidence in everyday communication.We assigned the ideals and processes to each dimension following Duncan et al.'s (2018) approach to mapping these ideals and processes, and in light of Duncan et al.'s definitions of the aim that each dimension addresses.Table 3 includes an overview of the aims of each dimension, as defined by Duncan et al. (2018), and the associated ideals and processes that were identified in our study.We also note, in Table 3, which of the ideals and processes that we identified are mostly similar to the ones already described by Duncan et al. (2018) and which ones are more novel.As can be seen from this analysis, all of the process codes and many of the ideal codes were similar to ones already identified by Duncan and colleagues.We suggest that this might be due to the fact that the GoE describes lay reasoning about evidence at a "mid-level of domain-specificity" that can apply across domains.Nonetheless, the novel ideals identified in our study suggest that evaluating psychological evidence foregrounds some specific challenges, such as issues of population sampling, construct definition, and the role of researchers' perspectives.
Our coding approach was not exclusive: when responses included mentions of multiple ideals and/or processes, we applied all of the relevant codes.Two coders (the first and third authors) coded 25% percent of the data, discussed differences in their coding and refined code definitions.Codes that were not sufficiently distinct from each other, or that could not be coded consistently even after discussions, were removed.The coders continued by independently coding an a Ideals and processes that are mostly similar to the ones described by Duncan et al. (2018).b Ideals that were not originally described by Duncan et al. (2018).Ã Replication was mentioned by Duncan et al. (2018) in the context of evidence integration.However, in our study, students referred to replication as a process that can help evaluate specific scientific reports.
additional 12.5% of the data.The resulting interrater reliability for all codes was higher than Cohen's Kappa > 0.78 (M ¼ 0.90, SD ¼ 0.07), which is a substantial level of agreement (Landis & Koch, 1977).

Analysis of discourse in the online collaborative tasks
In our analysis of the student discourse in the online collaborative tasks we used an explanatory mixed-methods approach.The goal of our qualitative analyses was to elaborate and refine the findings that emerged in our quantitative analyses (Creswell & Plano Clark, 2011;McCrudden & Sparks, 2014).We first quantitatively identified codes in which there was a significant change between the pre-assessment and the post-assessment.These codes were organized under three main themes of change, described below.We then read all of the transcripts of all of the group tasks and identified episodes that shed light on the themes of change identified in pre-and postassessment.We then chose one group in which the themes were richly discussed and elaborated and in which students expressed diverse viewpoints.We conducted a fine-grained qualitative analysis of this group's discourse throughout the semester.Our aim was to examine how epistemic ideals and processes were reflected, expanded, and negotiated in group work.This analysis is intended to offer a more situated and nuanced understanding of students' evolving grasp of psychological evidence.Although our main focus was on epistemic ideals and processes, we also attended to students' epistemic aims when these emerged from the discourse and could help explain uptake of ideals and processes.

RQ1: Students' Pre-Instructional evidence evaluation ideals and processes
Examination of students' pre-instructional evidence evaluation performance revealed that they employed diverse ideals and processes.Table 4 summarizes the percentages of students who mentioned each ideal or process at least once when evaluating the three popular accounts of psychological evidence in the pre-and post-assessment.We start by discussing the evidence evaluation dimension and then turn to laypersons' use of evidence dimension.Evidence evaluation dimension: Pre-Course epistemic ideals and processes In their evidence trustworthiness evaluations, students frequently emphasized the ideal of conclusiveness (94% of the students), that is, whether conclusions were justified in light of the data presented and not undermined by alternative explanations.(e.g., "The reason the researchers offerthat children will be afraid of challenging tasks in the future, is not based on evidence.It could be that getting too much praise led children not to make an effort and challenge themselves" [P16a]).A second highly frequent ideal was the availability of adequate empirical data (93%) (e.g., "The author does not rely on studies and empirical and objective data, but rather mainly on her personal experience" [P61a]).The high prevalence of these ideals indicates that even prior to their undergraduate studies, these first-year psychology students critically examined the relationship between data and conclusions and were sensitive to differences between scientific and anecdotal data.
The next ideals used by students reflect various dimensions of evidence quality and credibility.44% of students commented on the clarity of definitions offered for the phenomenon, variables, and research measures (e.g., "There is no clear definition for the groups being studied or the concepts being used.How, for example, did they decide that certain children have low self-esteem?"[P44a]).41% of students mentioned objectivity as an important evaluation criterion (e.g., "The decision whether a child has high self-esteem or not is, in my opinion, subjective.One researcher can watch the video and feel that the child has high self-esteem, and in contrast, another research will think otherwise" [P61a]).Students perceived "objectivity" as refraining from personal opinion or subjective judgment.28% of the students noted whether researchers presented valid generalizations from the data (e.g., "In my opinion, the evidence is not credible, because the psychologist who wrote this is basing her arguments only on children she has met … therefore we cannot deduce that these conversations are representative of the population as a whole" [P12a]).Finally, only 2% of the students mentioned the ideal of valid methods, that is, whether the study indeed examined the phenomenon that it was intended to examine (e.g., "The experiment did not examine what it set out to examine, that is, it was not valid" [P12b]).
In terms of evidentiary processes, students most commonly discussed the importance of how the study was set up. 81% mentioned the importance of evaluating the appropriateness of study design such as employing adequate comparisons and controls (e.g., "The evidence's credibility is rather high in my opinion because the researchers controlled for the variables" [P61a]), and 74% mentioned selecting adequate samples (e.g., "The researchers reached [a conclusion] from the results of a study that included a large number of participants that were compatible with the characteristics of the kind of subject required by the research question" [P61b]).Less attention was given to processes of establishing validity and reliability.Only 24% of the students commented on the need for replicating results (e.g., "Although the study was conducted properly, it is only one experiment.In order to validate and strengthen the results there is a need for replication" [P44b]), and only 13% mentioned using valid and reliable instruments (e.g., "I would want to know what instruments they used in order to measure and determine each child's self-esteem" [P49a]).
In sum, before the course, students were collectively aware of a wide range of ideals and processes for the evaluation of evidence.Namely, they were attentive to broad scientific criteria such as that conclusions need to be adequately justified by empirical data and that studies should involve sufficient samples and controlled comparisons.However, students had a generally lower awareness of ideals and processes that are foregrounded in psychological research, such as the ideals of valid methods and valid generalization, and processes of using appropriate instruments and replicating results.While these ideals and processes are not unique to psychology, they are arguably predominant in psychological research, and highlighting their importance was a key aim of the course.
Lay use of evidence dimension: Pre-Course epistemic ideals and processes.Students also used ideals and processes that are not based on specialized disciplinary knowledge and practices but rather on everyday resources that laypersons can employ to evaluate evidence.Participants often judged the personal coherence of the conclusion with their own experience and intuition (69% of the students).That is, they drew extensively on their personal knowledge and experiences to evaluate the evidence (e.g., "The conclusion is valid in my opinion because I agree with it in light of my personal intuition.I had similar experiences as a child, and I know that my home really helped me cope" [P16c]).
On top of relying on their "first-hand" judgements (Bromme & Goldman, 2014), students also appealed to a series of "secondhand" evaluation ideals, which focus on source trustworthiness.First, students mentioned (41%) the importance of source expertise (e.g., "In my opinion, the evidence is highly credible because it is mentioned that the author is a clinical psychologist" [P16b]).Beyond the source's expertise, students also noted, albeit somewhat less frequently (29%) the ideal of outlet credibility (e.g., "It was published in a scientific journal which probably requires that authors offer substantiated proofs for their arguments" [P61b]).Finally, students infrequently commented (11%) on the ideal of source benevolence and integrity (e.g., "The evidence seem credible to me because I do not see any other [vested] interest behind this piece that could result in bias" [P41a]).
The most frequent lay epistemic process used by students was (81%) critically reading popular science reports.Students criticized lack of clarity and detail in the reporting of the studies (e.g., "The description of the results is not credible enough in my opinion, because it is very general, without details on the number of trials or time dedicated to each trial in each group" [P10a]).Communicative quality was often used to infer authors' or researchers' credibility (Goldman, 2001); therefore, we considered this a predominantly lay evaluation process.However, reporting clarity and detail also serves the evaluation of methods.In addition, students exhibited awareness of the need for evaluating the source (53%) and not only the content (e.g., "This piece was published in a news site about a paper in a scientific journal.In my opinion, you cannot totally trust news sites" [P61a]; "The conclusion is presented by a clinical psychologist … " [P44s]).Taken together, these results indicate that students' lay evidence evaluation tended to rely more on firsthand evaluation of the validity of the findings based on coherence with their prior knowledge and experiences; though they already exhibited awareness of processes of secondhand evaluation of source trustworthiness (e.g., social vetting procedures to determine evidence quality).

RQ2: Changes Following instruction
Because the data violated assumptions of normality, we used McNemar tests to examine changes in the percentages of students who mentioned each epistemic ideal and process following the course.The frequencies of responses and statistical test results are provided in Table 4.We discuss the changes that emerged in light of three central themes that we identified in the data: (I) Shifts in sources of knowledgetoward coordination of personal and disciplinary knowledge; (II) Shifts in views of objectivitytoward procedural objectivity; (III) Shifts in definitions of psychological phenomenatoward operational definitions.
Theme I: Shifts in sources of knowledgetoward coordination of personal and disciplinary knowledge Following instruction, there was a significant decrease in students' reliance on the ideal of personal coherence with experience and intuition to evaluate evidence (from 69% to 52%).Students tended to rely less on their preexisting knowledge and experience to assess the evidence (e.g., "I think the conclusion is justified because I am familiar with the subject, and I know that these are things that promote the quality of sleep" [P50y]), and put more of an emphasis on evaluating the methodology of the studies.The decrease in personal coherence may reflect a growing recognition of the limitations of lay knowledge and a greater sensitivity to the division of cognitive labor between experts and laypersons (Bromme et al., 2010;Keren, 2018).That is, students demonstrated an understanding that lay knowledge might not be a sufficient basis for evaluating expert evidence.This decrease could also be due to increased disciplinary skills for evaluating evidence, which may have taken the place of reliance on personal judgments.Nonetheless, after the course, many students continued to rely on personal coherence when evaluating evidence.This suggests that students find it helpful to coordinate personal knowledge and disciplinary knowledge in the evaluation of psychological evidence.For example, one student wrote in the post-assessment: First, this piece was published in a newspaper and does not introduce any quantitative measures, but rather "advice," which is typical of magazine columns.Therefore, I don't view the conclusion as credible.However, I agree with some of the "advice" in light of my own personal experience, and hence I partially agree with the authors' conclusion.[P24t] In this example, the student appears to draw on both disciplinary knowledge about psychological measures as well as their personal knowledge.
Theme II: Shifts in views of objectivitytoward procedural objectivity Following instruction, there was a significant decrease in appeals to the ideal of objectivity (from 41% to 17%), accompanied by increases in use of the processes of replicating results (from 24% to 55%) and using valid and reliable instruments (from 13% to 23%).Taken together, we suggest that these shifts reflect changes in students' understanding of objectivity.Pre-instruction, students leaned toward a view that objectivity is achieved when scientific inquiry is unaffected by researchers' subjective judgments, preferences, and perspectives (e.g., "[The conclusion] is partially validit does not stem from objective study of the phenomenon, but from the author's personal experience and subjective point of view" [P12ss]; "Although the information sounds credible and intuitively correct, all the information and conclusions were based on [the researchers'] subjective opinion, and hence the conclusion is not valid" [P44al]).This focus on objectivity, which was not identified in the original description of the GoE framework (Duncan et al., 2018), may reflect a key difference between the natural and social sciences.In the social domains, researchers' perspectives and interpretations are more salient to students than in science domains and can be perceived as an obstacle to establishing reliable knowledge (Kuhn, 2010;Thomm et al. 2017).As Kuhn (2010, p. 822) has noted: "In the social domain, then, the major challenge is to conquer the view that human interpretation plays an unmanageable, overpowering role.In the science domain, the major challenge is to recognize that human interpretation plays any role at all."We suggest that the decrease in mentions of "objectivity" (in the subjectivity-free sense) following the course reflects students' growing awareness that interpretation is integral to developing theories and producing evidence.
Following instruction, students appeared to develop a more procedural view of objectivity (Douglas, 2007).The shift is reflected in students' increased mentions of measurement processes.The process of using valid and reliable instruments focused on the accuracy and the reliability of the measures used by the researchers.Note how the same participant (P44al) now highlighted the importance of valid and reliable instruments: "The research instrument solely checked motor activity, but learning consists of other aspects as well.Hence, this is not a valid instrument."Thus, students more frequently argued that by employing valid and reliable instruments, researchers can measure the phenomenon in a more objective manner.(As we show below in our analyses of students' collaborative discourse, students were also aware of the limitations of this approach.)Students' shifting view of objectivity was also reflected in growing attention to the process of replicating results by other members of the scientific community.As mentioned, the course highlighted the social nature of science, including efforts to promote objectivity by showing that different researchers can achieve the same results.Again, we can see how the same student (P12ss) who highlighted the role of objectivity in the pre-assessment appeals to different criteria in the post-assessment: "The conclusion is not valid in my opinion, as this is only one study (and we don't know its breadth or the number of repetitions of the experimental protocol), without presenting other studies that support this conclusion."The attention to replication reflects growing awareness that objectivity in science is established through social processes (Kienhues et al., 2020;Longino, 1990).Taken together, these shifts reflect a view according to which scientific conclusions are objective not because researchers are free of personal judgments, preferences, or values, but due to the reliability and replicability of the procedures they employ.
Theme III: Shifts in definitions of psychological phenomenatoward operational definitions The third shift involved changing views of the conceptualization of psychological phenomena.Specifically, students focused less on the nature of psychological phenomena and more on how these phenomena can be studied empirically.This shift is reflected in a decrease in the clarity of definitions ideal (from 44% to 18%), together with an increase in mentions of valid methods (from 2% to 38%).The clarity of definitions ideal refers to the conceptual definition of psychological phenomena and research questions in a precise manner (e.g., "it is unclear how the researchers define 'low self-esteem'.We do not have a way to know if the definition was correct and precise."[P36l]).The decrease in mentions of clear definitions occurred in parallel to a sharp rise in appeals to valid methods.The ideal of valid methods highlights the challenge of choosing methods and measures that capture the intended phenomenon.Here, the emphasis shifts from focusing on defining a phenomenon per se (e.g., What is self-esteem?) to focusing on finding methods to study this method empirically (e.g., "the conclusion is valid because they wanted to examine whether sleep influence learning and this is indeed what they examined" [P61d]; "I think the conclusion is valid because it refers to the findings that the researchers set out to examine.That is, they examined whether memories of activities are coded during sleep declaratively, and they did so by studying people who have problems with their declarative memory" [P12sa]).We suggest that these changes reflect a shift in students' thinking about the interplay between evidence and phenomena.The emphasis on clearly defining the phenomena at hand reflects a view that the onus of effort is for scientists to precisely conceptualize the phenomena they study, whereas the focus on valid methods implies that the central challenge is finding valid ways to empirically study and measure these phenomena.Thus, the focus shifts to operational definitions that describe phenomena in measurable terms.

RQ3: Changes in evidence evaluation reflected in online tasks
We now turn to discuss the various ways in which students' understanding of the nature of psychological evidence emerged as they engaged in collaborative evidence evaluation and design tasks.The collaborative context required students to explicate their understandings of psychological evidence to group members in order to reach a shared understanding that would be a basis for their task product.Consequently, examining group discourse can support a more refined and situated description of the epistemic ideals and processes underlying students' engagement with psychological evidence, as well as to the aims that guided and shaped them.Our analyses examine all of the three themes of change described above.However, two of the themes (shifts in views of objectivity and shifts in definitions of psychological phenomena) were more prominent in the group discourse and are hence elaborated more.
We follow the dialogues of one group throughout the semester.The focus on one group is intended to facilitate an in-depth analysis of the trajectories of students' epistemic practices over time and across tasks.We chose this specific group as its discourse was rich and exhibited increasing levels of complexity with respect to the above themes.The group included five students, three men and two women; all participants successfully completed all course assignments and passed the final test.
Task 1: in which a flawed experiment Sparks a search for objectivity and validity In the first task, students read a newspaper article about the infamous "Little Albert" experiment conducted by John Watson and Rosalie Rayner.The article described recent revelations concerning ethical problems with the experiment, beyond those already identified in the literature (i.e., raising questions concerning Albert's health and casting doubt on his mother's consent).The lesson following the task focused on teasing out ethical and methodological issues with the experiment: Professor A surveyed both the ethical problems with original experiment in terms of conducting conditioning studies on infants and the lack of meaningful consent, as well as the important developments in ethical standards and practices of psychological research over time.
Here the emphasis was on the underlying rationale of ethical guidelines and the ways in which the scientific community has shifted toward prioritizing participants' well-being and active consent.In addition, Professor A also emphasized that the validity of scientific theories does not rest on one experiment, but rather on the social and accumulated effort to replicate and refute existing results.
In the task, students were asked to identify the experiment's methodological shortcomings (i.e., a sample of one subject, inconsistent exposure to stimuli, unclear measures) and to formulate alternative ways for studying the phenomenon examined in the original experiment (classical conditioning).The goal of the task was to introduce students to the rationale underlying design decisions, as well as to the challenges of translating a research question into an appropriate study design that can produce conclusive evidence.The lesson following the task focused on teasing out ethical and methodological issues with the experiment: Professor A surveyed changes in ethical standards and practices and their underlying rationale and also emphasized that the validity of scientific theories does not rest on one experiment, but rather on the social and accumulated effort to replicate and refute existing results.
Stage 1: Seeking objectivity.Because the original experiment had only one subject and was not clearly structured, students initially focused on the epistemic processes of selecting adequate samples and using an appropriate research design.This was in line with students' familiarity with these processes in the pre-course assessment (Table 4).Consider the group's initial response to the prompt to find better ways to study the phenomenon: 1 As can be seen, the group initially focused on addressing the original experiment's sample (turns 187-188).Then, Nora, prompted by the experiment's faulty design, started thinking about how to address ethical issues ("healthy subjects") and to improve its reliability (turn 189).Her initial attempt was based on an appeal to procedural objectivity (Douglas, 2007) As group discourse was conducted in the Moodle chat application, turns of talk were sometimes very brief.For the sake of clarity, we combined consecutive turns of talk by the same speaker, and deleted turns that were referencing unrelated parts of the conversation.
rigid protocol that is uniformly applied ("being consistent rather than changing the experiment's procedure arbitrarily").Such consistency can help ensure that regardless of who is performing the experiment, the result will be the same.
Stage 2: Questioning validity.While the group began by focusing on the study's objectivity, the request to redesign a flawed study led them to engage with underlying questions concerning the interplay between the experiment's design and the psychological phenomenon it aims to explore: that is, to question whether the methods uphold the ideal of validity.
The group started out by discussing at length how to create ethical alternatives for the original experiment.While certain group members argued that they should offer an alternative experiment with adult participants, others questioned whether conditioning can be studied on older participants in light of their earlier experiences.The group decided to conduct an experiment with babies, but to offer them positive reinforcements to a neutral stimuli, thus avoiding some of the ethical problems with Watson's study 2 .These alternatives, however, led to a debate about what exactly is the phenomenon studied in the original experiment: Notice how Ruby (221) picked up Michael's critique of her earlier emphasis on fear conditioning and positioned it with respect to a broader view of the underlying theoretical proposition.Thus, the need to analyze flawed evidence, as well as to come up with an alternative design, led Michael and Ruby toward pursuing greater definitional clarity prior to engaging with the actual study design.After the group agreed that the aim of the experiment was to examine Watson's theory of classical conditioning, rather than fear conditioning specifically, they proceeded to search for a valid study design that would allow them to draw conclusions concerning the phenomenon of conditioning.Critically, they tried to do so while avoiding the ethical pitfalls of the original experiment.To this end, the group chose to replace fear conditioning with an experimental design based on introducing a pleasant stimulus: The group attempted to formulate a research design that would elicit a response that was different from what would be normally expected, such as a happy response to a neutral stimulus.As one of the reviewers of this paper commented, the very notion of examining operational conditioning with infants is ethically questionable.As mentioned above, this issue was discussed in the ensuing lesson.At the same time, it is important to note that the discussion of the Little Albert experiment led to deep engagement with and attentiveness to ethical issues.
This move reflects a shift to concern with the validity of the methods, that is, whether these methods accurately capture the phenomenon under investigation.Yet, at this stage of the group's work, the phenomenon itself was not problematized; that is, it was assumed to be stable and observable.Hence, validity was achieved by finding ways to manipulate participants' behavior in order to reveal the underlying psychological phenomenon.
Stage 3: Finding ways to increase validity.Ongoing concerns about validity (are we measuring what we are supposed to measure) subsequently led the group to start arguing about what a "neutral stimulus" might be: The group subsequently discussed the ethical problems with exposing babies to a snake and decided to employ a safer stimulus.In addition to these ethical dilemmas, the discussion surrounding the question of how babies would react to such a stimulus brought out two key aspects of students' epistemic practices.The first aspect related to the coordination of personal and disciplinary knowledge.In the group discussions, students rarely directly appealed to personal knowledge in order to judge or construct psychological evidence.The tasks appeared to encourage students to draw on disciplinary knowledge that was introduced in the course (e.g., "stimulus," "conditioning").However, students implicitly relied on personal knowledge as they thought through the studies.For instance, in the above excerpt, the question concerning babies' natural reactions to animals (turns 306-309) is based (as far as we can tell) on students' personal knowledge about babies.This personal knowledge enabled them to make conjectures about the feasibility of the study's design.This suggests that personal knowledge took on an auxiliary role in these discussions: it facilitated a disciplinary discussion concerning how to develop a design that would enable generalization of the phenomenon.Put differently, although personal knowledge was not considered in and of itself as a legitimate source for the justification of claims, it nonetheless helped students engage in disciplinary practices for establishing claims in coordination with their growing disciplinary knowledge.At the same time, we highlight that even in the first task, students strongly relied on disciplinary knowledge introduced in the course.
This leads us to the second aspectcreating connections between epistemic processes and epistemic ideals.Michael shifted the focus of the discussion by connecting their study design (examining babies' reaction to a rat or snake) to the underlying behaviorist theory, arguing that lack of fear is a behaviorist conjecture (turn 309).In other words, Michael connected the study design to the ideal of valid methodsraising questions about what could be learned from their current experimental setup.Critically, this more complex understanding advanced the group's work, leading Ruby to question the experiment's validity for studying conditioning (turn 310).This led the group to the realization that they first needed to establish the baseline response to the stimulus in order to ascertain the validity of the experiment:

306
Lucas 09:28 The first step in the experiment is to examine the reaction to a neutral stimulus ¼ rat/snake.We will expect that the baby will be disinterested.The first step is to examine how the babies respond in the first place.We can then suggest that in the next step we would only study those babies that did not have a special reaction/were disinterested.324 Michael 09:32 I think we need to study all the babies, otherwise we are affecting the experiment's results.We can then retroactively argue that these babies were not relevant to begin with.
What is interesting about this excerpt is the manner in which the requirement to redesign the study led group members to engage with questions of validity, which is conceptualized as the capacity to offer an appropriate research design that facilitates arriving at conclusive results with respect to the phenomenon at hand.This is reflected in Michael's assertion that the experiment should examine whether human beings are naturally afraid of snakes (turn 322).That is, the study should reflect the arguments of the theoretical position it is based on.This was accounted for in their study design when Lucas offered to add an initial step in which they examine the babies' baseline reaction to the stimuli (turn 323).In other words, the group redesigned the study in order to increase the methods' validitywhether they appropriately represented the phenomenon they wished to study.
To conclude, we suggest that this task demonstrates how the request to redesign a flawed study can lead students to move beyond an emphasis on an appropriate design (e.g., a control group and repeated measurements) to grappling with the epistemic ideals that underlies design choices: whether the methods enable drawing reliable and valid conclusions concerning psychological phenomena.The group's process of better defining the phenomenon they wished to study and then examining whether their design actually represented this phenomenon was achieved through challenging the meaning of constructs and stimuli suggested by other group members.This led to a problematization of design choices, which in turn drew attention to the ideals that underlie these choices.In the next task, the group moved beyond the issue of study design to engage more deeply with the difficulties of measuring psychological phenomena.
Task 2: in which an elusive psychological phenomenon problematizes notions of validity and objectivity In the second task, students watched Rana El Kaliouby's (2015) TED talk, in which she argues that she has developed an algorithm that can identify human emotions via analyses of facial expressions.Students were asked to evaluate this claim and to offer ways to scientifically study emotions.Here again, students encountered flawed evidence as the TED talk both lacked a definition of what emotions are (which led to overlooking important aspects, such as individuals' sense of these emotions), and did not adequately account for the relations between facial expressions and the emotions they are supposed to reveal (e.g., by offering alternative measures that could validate the interpretation of facial expressions).Professor A's goal in this task was to illustrate how the scientific study of psychological phenomena relies on identifying measures that allow generating quantitative data concerning complex and subjective constructs (such as emotions) in a valid and reliable manner.In the ensuing class, Professor A discussed the history of the study of emotions in psychology, from Williams James to recent studies.This description focused on how developments in the study of emotions were based on identifying lacunas in existing theories and measures and designing experiments that aimed to critically examine the accuracy of existing definitions and offer better ones.
Stage 1: What counts as objective evidence?.Although the task prompt was quite broad ("How could we study emotions scientifically?"), the group immediately focused on the process of using valid and reliable instruments that would allow measuring emotions: experience of the person with respect to their feelings, because this expression of emotions is also significant.
Two competing notions of objectivity surfaced in this short excerpt.First, the idea was surfaced that objectivity is tied to quantifiable and impersonal measures: Ruby asserted that to "scientifically" study emotions, the group "needs" to rely on "scientific and biological processes that can be measured empirically" (line 170).Second, to complement the biological measures, Ruby suggested using questionnaires that are intended to probe the "objective sense of one's emotions" (turn 172).Ruby's use of the word "objective" here initially seems to be a slip of the tongue, as she referred to "expression of emotions" (later in the discussion she described these as "subjective").However, although Ruby's word choice seems to reaffirm the dichotomy between objectivity and subjectivity, she argued for the "significance" of personal experiences and emotions.Considering her intent, Ruby's slip of tongue takes on a new meaning: By calling reports of emotions an "objective" source of evidence, she turned the objective/subjective dichotomy on its head and disrupted the assumption that objectivity requires ignoring personal emotions and experiences.At the same time, Ruby prioritized quantifiable measures as a means for "objectively" measuring emotions rather than a more up-close exploration of emotions.
In response to Ruby's emphasis on the scientific study of emotions, Lucas added that "We also need to write something about statistics" (turn 171).Lucas's interjection suggests that he is more concerned with the non-epistemic aim of "meeting task requirements" than with the epistemic aim of "designing a good study."Nonetheless, the perceived requirements of the task led Lucas to introduce another dimension of scientific objectivity: presenting evidence to others in a clear and accessible manner, in accordance with communication conventions and norms.Reporting evidence in such a manner makes it more amenable to public scrutiny and replication.
Stage 2: Confronting limitations.The group members immediately challenged the validity of the instruments put forward by Ruby: Dan raised an objection to the validity of "brain wave" measures, arguing that they can merely identify the regions charged with emotions, rather than specific emotions (175,177,180).While initially trying to defend her position (turn 176), Ruby suggested a quick and easy compromise based on a vaguer term -"brain activity" (turn 178).However, Dan was not satisfied and continued to argue that "there isn't a precise understanding of how emotions are expressed in patterns of electrical activity in the brain."Here Dan offered a fundamental critique of the very idea that biological measures can validly measure emotions.Michael offered another criticism of the validity of questionnaires, arguing that they do not measure emotions per se, but rather one's self-perception of them (turn 179).That is, these questionnaires are not just a poor proxy for emotions, they measure a different construct altogether.This comment reflects Michael's awareness that determinations of validity are entangled with issues of construct definition.
This back and forth reflects engagement with the complexity of identifying valid methods to study complex psychological phenomena.The group faced the fundamental problem of measuring There isn't a precise understanding of how emotions are expressed in patterns of electrical activity in the brain.It's not like there is some sort of measure that can be identified and then we can simply say "the person is sad/happy" and so on.
psychological phenomena; as Dan put it: "It's not like there is some sort of measure that can be identified and then we can simply say 'the person is sad/happy' and so on" (180).In this respect, the task design, which called for a "scientific study" of an elusive psychological phenomenon, invited students to problematize the notion of validity and to confront its boundaries.This discussion also demonstrates how charging students with defining epistemic processes of evidence collection can lead to deep engagement with underlying idealssuch as conceptual clarity and validityand the interactions among them.At the same time, it is important to note that the group did not engage deeply with Dan's and Michael's critiques of the inherent limitations of available measures of emotions.The press for time, and students' preoccupation with the aim of "meeting task requirements" by the end of the session, led them to push on to find a solution for the problem.
Stage 3: Validity, objectivity and the interplay of converging measures.The group suggested overcoming these challenges of clarity and validity by integrating multiple instruments that can complement each other, a theme they continued to develop: Their initial solution involved using and comparing different measures (turn 184).Here the group seemed to appeal to convergent objectivity: "when epistemically independent methodologies produce the same answer, our confidence in the objectivity … of the result increases" (Douglas, 2007, p. 133).Though they acknowledged the limitations of each instrument, they suggested that by combining them they could bolster the study's objectivity.However, triangulating measures could potentially also help to ascertain the validity of the measurement of the phenomenon.Interestingly, Michael challenged this notion by arguing that combining measures would merely validate the questionnaires themselves, rather than actually measuring emotions (turn 186).However, it seems that group members did not fully grasp the implications of Michael's criticism, and reiterated the value of using and comparing different instruments (turns 187-188).
While this solution could be seen as a missed opportunity to interrogate and explore the uncertainties of measures of emotion, it nonetheless represents a shift from students' approaches to the first task.Whereas in the first task, students addressed validity mainly by ascertaining the relations between the phenomenon and the study design, here the emphasis was on combining multiple instruments to overcome the challenges of validly and reliably measuring an elusive psychological phenomenon.The group reached an impasse in their efforts to agree on adequate research instruments because the validity of all of the measures is fundamentally uncertain.This led them to focus on the relations between different research instruments as a path for improving both the study's validity and objectivity.Nonetheless, as is evident in the last excerpt, the students almost uniquely focused on how to measure emotions without defining the phenomenon itself.This challenge was the focus of the third task, in which the group started formulating operational definitions.We should note that you need to use a variety of instruments and draw conclusions on the basis of the data collected from all the instruments rather than relying on just one instrument.
Task 3: in which an Ill-Defined construct creates a need for operational definitions In the final task, students listened to a podcast (Invisibilia, 2015) that described different approaches to psychotherapypsychodynamic, cognitive behavioral therapy (CBT), and mindfulness.Students were then asked to devise a scientific way to examine an argument put forward by one of the interviewees in the podcastthat psychodynamic therapy is "deeper" than the other two forms of therapy because it deals with the meaning and origins of negative thoughts.Here students encountered a proposition put forward in a popular account that was not substantiated on the basis of empirical evidence.Instead, in the podcast, the interviewee aimed to support the argument concerning depth by characterizing the different level of engagement with the sources of undesirable thoughts.Students were then challenged to devise ways to empirically test this analytical assertion.This task was intentionally structured around an ambiguous concept as it aimed to highlight the importance of developing clear operational definitions of psychological phenomena (an issue many groups sidestepped in the second task, as seen above).As will be demonstrated, this led students to grapple with the question of how to examine "depth" in a valid and reliable manner.In the ensuing class discussion, Professor A revisited these themes: she discussed how different approaches to psychotherapy rely on different definitions of the aims of this process and explained that in order to compare such approaches empirically, researchers must make difficult choices concerning how to define what "successful therapy" is.These definitions are then open both to conceptual critiques and to the challenge of empirical replication across samples, designs, and contexts.
Stage 1: Difficulties defining the phenomenon at hand.The discussion started with a disagreement: Ruby stated that she disagreed with the claim that treatment that deals with the significance and origins of thought is a deeper treatment, because what matters the most is patients' well-being, and different patients may benefit from different types of treatment.Michael, in contrast, reaffirmed the psychodynamic perspective that investigating the sources of the thoughts is a deeper solution.The group was divided between Ruby and Michael's positions and discussed the issue at length, until Ruby proposed: "Let's agree to disagree.There is no unequivocal answer" (turn 174).Lucas, as usual, urged the group to push forward and to start planning the experiment.Ruby responded by suggesting an experimental design to examine which therapy is "deeper," yet Michael immediately interjected, insisting that that they still needed to define what "depth" means (turn 211): Note here that Michael made a strong connection between the need to offer a clear definition of depth and the effort to construct a valid study, an argument that received support from Lucas.In contrast to the previous task, in which Michael's attempt to problematize the definition of Stage 2: Exploring relations between operational definitions, validity and objectivity.Ruby problematized the definition of "depth" (turn 223), stating that it is hard to come up with a non-controversial definition.This led the group to offer various competing definitions: Here the group engaged in discussions concerning the meaning of "in-depth therapy" Group members formulated different ways to operationalize depth: questionnaires asking directly about the therapy's depth (Ruby,223), change over time (Nora,224), identifying the cause of thoughts (Michael,225), and addressing the problem's source (Ruby,226).While the group ended up accepting Michael's notion of depth, this acceptance was provisional as Ruby and Nora later continued to argue that there is more than one way to define depth.We suggest that this back and forth reflects an emerging distinction between the phenomenon (which can receive multiple formulations) and its operational definitiona definition limited to the context of a given study and that can render the phenomenon amenable to scientific inquiry.While Michael's insistence could be interpreted as an effort to precisely define the phenomenon itself, we suggest that other group members shifted to focusing on its operational definition.
Interestingly, these suggestions were entangled with attempts to improve the study's objectivity.Ruby once again mentioned the dichotomist view of objectivity, stating that questionnaires about the therapy's depth are "subjective, not empirical."(223).Rather than accepting that questionnaires are simply subjective, Nora attempted to address this concern by using these instruments within the study design (i.e., repeated measurements over a longer period of time).Michael attempted to address the issue of a questionnaire's validity by arguing that it could be used to address his conception of "depth" (i.e., understandings of the sources of negative thoughts).Perhaps most importantly, Ruby wrapped up this exchange by highlighting the contingent nature of scientific inquiry: backing up from the idea of identifying what depth "really" is and focusing on the alignment of conceptual and operational definitions.
Stage 3: Have we achieved validity and reliability yet?.After some more back and forth, the group converged on the conceptual definition of depth as "engaging with the problem's roots and understanding its source" and operationalized this via questionnaires that ask participants about the causes of their negative thoughts.The group then reflected on the validity and reliability of their proposed study.Lucas sparked this discussion with a comment on the need to mention these concepts in the assignment (reflecting his recurring aim of "meeting task requirements"), but this seemingly technical comment (turn 329) led the group to conduct a more thorough examination of this issue: Lucas' aim of "meeting task requirements" was suggested by his proposal to "add a couple of sentences" that addressed the issues of reliability and validity that were discussed in class.However, the ensuing discussion suggests that this aim converged with the group's epistemic aim of "designing a good study" as the perceived task requirements ("This showed up in one of the lessons, and it's important," turn 335) were taken up as norms for defining what counts as a "good study."In response, the group negotiated the meaning of validity and reliability in their study: members argued that their study was valid due to the alignment between the construct definition and the research instruments (turns 331, 332).The students emphasized the compatibility between their instruments and the operational definition they formulated ("a therapy that tries to understand a problem's causes"), an aspect that was mostly absent in their work on emotions.Participants also negotiated the meaning of reliability: Ruby suggested that reliability is achieved via the reliance on a uniform protocol (333), an idea that was raised in class, while other group members acknowledged the limits of this due to the reliance on self-reports (334,336).Here again we see an understanding of procedural objectivity, also mentioned in the first task, which is achieved by using a uniform protocol.This demonstrates how Lucas' seemingly instrumental comment, which was guided by the aim of "meeting task requirements," led the group to reflect more deeply on their epistemic ideals, and thereby helped them achieve the aim of "designing a good study." Yet, group members were aware that despite all their efforts, there are still two important limitations to achieving reliability and validity.First, there are sources of error that cannot be controlled due to participants' actions (334), and the tension between the research instrumentsquestionnairesand the phenomenon exploredpsychodynamic therapy (336).Second, validity is limited due to the fact that members acknowledged that they cannot study the "phenomenon itself," and instead focused on an operational definition that captures particular aspects that are amenable to inquiry.To conclude, this discussion demonstrates how a task that centered on studying an ill-defined phenomenon required students to engage with operational definitions.The effort to offer an operational definition that captures key aspects of the phenomenon, while being amenable to scientific inquiry, led the group to develop their understandings of the epistemic ideals of validity and objectivity through carefully choosing appropriate designs and instruments.It also led to more refined awareness of the limitations of psychological research.

Discussion
This paper argues for the importance of nurturing students' competency to critically evaluate psychological evidence, especially everyday evidence.Building on the Grasp of Evidence framework (Duncan et al., 2018), we suggest that students' evidence evaluation practices can be bolstered by actively engaging students in collaborative evidence evaluation and design, even in large introductory psychology courses.Specifically, we set out to examine (1) psychology undergraduate students' pre-instructional epistemic ideals and processes for evaluating popular psychological evidence; (2) if, and how, these epistemic ideals and processes change following instruction; and (3) how various ideals and processes emerge as students engaged in collaborative evidence evaluation and design tasks.
With regard to the first research question, analyses of students' pre-instructional grasp of evidence revealed an awareness of the importance of empirical data, and a basic understanding of scientific methods, coupled with substantial reliance on coherence with personal experience and intuition.As shown in previous work, students were well aware of the basic concept of evidence and the need to adequately support claims with data (Hogan & Maglienti, 2001;Miralda-Banda et al., 2021).However, students seemed to have a mechanistic notion of evidence as involving adequate samples and comparisons and did not consider issues of measurement validity and reliability or the social nature of evidence (e.g., replication).These findings are aligned with previous research that illustrated the challenges of appreciating the unique features of scientific research in psychology (Balloo et al., 2018;Lyddy & Hughes, 2012), and students' tendency to examine psychology along the lines of research methods characteristic of the natural sciences (Amsel et al., 2014;Estes et al., 2003).
The analysis of students' epistemic ideals and processes for evaluating psychological evidence revealed that all of the processes and many of the ideals that were used by the students were similar to the ones identified by Duncan et al. (2018) in the context of the natural sciences.This supports the claim that the GoE framework describes lay reasoning about evidence at a "mid-level of domain-specificity" that can generalize across genres of inquiry (Duncan et al., 2018, p. 913).This finding suggests that developing students' understanding of evidence at a mid-level of domain-specificity can have broad benefits.At the same time, we also identified some ideals that appeared to reflect particular challenges that are predominant in psychological research, including the challenges of appropriately generalizing from samples, clearly defining constructs, and addressing researchers' subjectivity.Although none of these issues are exclusive to psychology, we argue that they reflect some recurring challenges of teaching psychological science, such as the challenge of studying complex and socially-embedded phenomena that students (and researchers) have intimate firsthand experience with, but that are not always directly measurable.
With respect to the second research question, following the course, participants exhibited three main shifts in their evidence evaluation practices.First, there were shifts in sources of knowledge, with a significant decrease in students' reliance on personal knowledge.We suggest that this decrease could be understood as reflecting growing appreciation of the division of cognitive labor between experts and laypersons (Bromme et al., 2010;Keren, 2018).That is, students might have increasingly grasped that personal knowledge is not always a sufficient basis for evaluating psychological evidence.This decrease could also be attributed to students' increased disciplinary competence in evaluating evidence.Nonetheless, even post-instruction, about half of the students appealed to their personal experience of knowledge to evaluate evidence.Students persisted in their reliance on personal knowledge along with their growing reliance on disciplinary knowledge.This could also reflect students' early stage in their undergraduate studies and their evolving understanding of psychological science.Yet, it is also possible that students can productively continue to use both sources of knowledge and that what changes and develops over time is their ability to better coordinate personal and disciplinary knowledge.The need and capacity to coordinate disciplinary and personal knowledge was also evident in student discourse in the group tasks, where personal knowledge was auxiliary, and served mainly to complement disciplinary knowledge.As disciplinary knowledge grows, students might increasingly appreciate when to lean on it and when they can appropriately integrate disciplinary knowledge with their personal knowledge.
Second, we identified shifts in students' views of objectivity, with a growing appeal to procedural rather than value-free conceptions of objectivity.This shift was reflected in a significant decrease in explicit mentions of "objectivity" (in the value-free sense), combined with a significant increase in appeals to measurement processes: using valid and reliable research instruments and replicating results.Thus, the emphasis shifts from objectivity as a lack of subjective bias to a more procedural view of objectivity, according to which scientific evidence is justified through the processes underpinning its production (Douglas, 2007;Kienhues et al., 2020).Objectivity is no longer understood solely as an attribute of the individual scientist, but rather more attention is paid to how objectivity rests on appropriate procedures and social processes.This reflects a nascent appreciation that "objectivity of scientific inquiry is a consequence of this inquiry's being a social, and not an individual, enterprise" (Longino, 1990, p. 67).This shift was also reflected in student discourse, as students exhibited increasing awareness of the complex nature of measurement processes (i.e., their discussion of different measurements and their interplay in the emotion task).
Finally, there were shifts in definitions of psychological phenomena, with a growing appreciation of the role of operational definitions in psychological research.Post-instruction, students paid more attention to how psychological phenomena are conceptualized in ways that render them amenable to scientific study.This was reflected in a decrease in mentions of the ideal of clarity of definitions together with a sharp rise in appeals to valid methods, which reflects a stronger focus on the relation between the phenomenon and how it is studied.This shift reflects an acknowledgement that beyond the conceptual understanding of the phenomenon, researchers must ultimately find ways to offer definitions that would allow to study it reliably.Here again, the shift was manifested in students' groupwork, especially in the third task where group discourse reflected an emerging appreciation of the diverse ways a psychological phenomenon can be defined.As the group sought to bridge their differing understandings of the phenomenon, they recognized that validity can be achieved by aligning conceptual and operational definitions.
Taken together, these shifts illustrate that substantial changes in students' grasp of psychological evidence are feasible over a semester-long course.Yet, it should be noted that these changes were all in the dimension of evidence evaluation, rather than layperson use of evidence.This leads us to suggest that the lay dimension was not sufficiently emphasized in the course, and that achieving such an aim could require positioning it as a more explicit goal of instruction (cf.Leung, 2020).

Emerging design principles for collaborative engagement with psychological evidence
Considering the third research question, analysis of students' group work illustrated how online collaborative work could be leveraged to engage students in disciplinary epistemic practices even in large introduction to psychology courses.In what follows, we identify key design principles that could facilitate engagement with epistemic practices.
Collaborative critique and redesign of flawed studies First, we suggest that the collaborative critique and redesign of flawed studies has unique affordances for engagement in epistemic practices of evidence evaluation.Attempts to foster evidence evaluation, as well as digital literacy skills, have demonstrated the value of engaging students in critical evaluation of authentically unreliable sources and low-quality evidence (e.g., Rinehart et al., 2016;Wineburg et al., 2022).We suggest that to foster even deeper grasp of evidence, students can be invited to come up with reliable alternatives to studying the phenomena they are reading about.When students were charged with formulating better ways of studying psychological phenomena, they paid increased attention to how reliable evidence is produced.For instance, the first task engaged students with the epistemic ideals of validity and objectivity underlying the more mechanistic aspects of study design.More specifically, the requirement to delve into the limitations of the original design, while contemplating more productive and ethical alternatives, offered opportunities to problematize design choices that are not easily visible when students design their own study from scratch.These, in turn, invited students to attend to the possible tensions between epistemic processes (e.g., developing appropriate designs, selecting adequate samples) and the epistemic ideals that underpin them.Moreover, such tasks could be used to elicit engagement with ethical questions, such as those highlighted by the Little Albert experiment, including weighing the benefits of scientific studies against their potential harms, addressing participants' identities and perspectives, and highlighting the social conditions that are underpinned by and enacted in scientific work (Nasir et al., 2021;Philip & Sengupta, 2021).
We suggest that evidence evaluation and design tasks support a two-fold process of evidence evaluation-first students critically evaluate flawed evidence, and then, as they are asked to offer alternatives, they engage in critical evaluation of their own evidence.Invitations to critically analyze flawed evidence can expose situations in which scientific studies fail to apply appropriate epistemic processes and ideals.This creates opportunities for meaningful meta-epistemic discourse about what these processes and ideals are and especially why they matter.Students' emerging meta-epistemic understandings and commitments are then put to the test when they need to redesign the study themselves.The analysis of the epistemic aims that undergirded students' epistemic ideals and processes supported the value of engaging in redesign.The in-the-moment epistemic aim of "designing a good study" appeared to drive meaningful and critical discussions of evidentiary ideals and process.Further, as illustrated in tasks 2 and 3, even when students focused on the non-epistemic aim of "meeting task requirements," their perceptions of the requirements of the redesign tasks also created opportunities for engagement with meta-epistemic issues, such as figuring out what are "valid and reliable measures." The necessity to tackle these challenges in a collaborative setting encouraged students to explicate their own thinking and to critique and develop the various definitions and measures suggested by their group members.The need to justify choices and to respond to criticisms can deepen students' meta-epistemic understanding of epistemic processes and how these might fail or succeed to uphold epistemic ideals.Beyond the pedagogical advantages of requiring students to explicate their own thinking and critique others', we contend that this process is highly reflective of the nature of scientific research.The effort to improve interlocutors' conceptualizations simulates the processes scientists engage in when aiming to critique and improve existing operational definitions and measures.Thus, engaging with such tasks can foster meta-epistemic understanding of the social epistemic processes that can increase the objectivity of scientific research (Douglas, 2009;Kienhues et al., 2020).
Engagement with diverse sources of popular evidence A second design principle is providing opportunities for engagement with evidence as it is communicated in popular media.Engagement with diverse popular accounts of scientific evidence has several benefits.First, such accounts may be more engaging and comprehensible, which can encourage and facilitate critique.This appeal of popular accounts is not only due to their language and writing style, but also to their engagement with everyday issues, which are of interest to a broad lay audience.Hence, such sources can support students' engagement in and capacity for evidence evaluation in light of their personal connection to the issues at hand (e.g., emotions and "dark thoughts"), and the significance of such issues to their everyday lives and to the lives of their families and communities.
This leads to a second potential contribution-positioning students as "competent outsiders" (Feinstein, 2011) by providing them with opportunities to consider how evidence is communicated in the media.Such efforts do not only evoke meta-epistemic awareness of the limitations of psychological research methods, as detailed above, but also draw attention to the trials and tribulations of communicating psychological evidence in popular accounts, as well as to the features of various modes of evidence presentation.This creates opportunities for developing "secondhand" evaluation skills of examining the providers and communicators of the evidence (Bromme & Goldman, 2014).
However, it should be noted that students' positioning as "competent outsiders" only partially materialized in our study, as participants did not exhibit significant increases in their awareness of epistemic ideals and processes that reflect competent lay use of evidence.To better support engagement with lay ideals and processes, evidence evaluation tasks may need to incorporate sources with more diverse quality and reliability as well as sources that present contrasting arguments.Encountering conflicting sources can encourage students to critically reflect on the credibility of these sources (Braasch & Bråten, 2017;Stadtler & Bromme, 2014).Additionally, the design of the course, as reflected in the collaborative tasks and ensuing class discussions, primarily highlighted expert evidence evaluation practices.To cultivate lay evaluation practices, this might need to be a more deliberate instructional goal.For example, task requirements could also invite engagement with methods that laypersons can use to evaluate sources and corroborate claims online (e.g., Wineburg & McGrew, 2019;Wineburg et al., 2022).Ultimately, the goals of fostering lay evidence evaluation practices and appreciation of expert evidence evaluation practices are complementary goals that jointly constitute students' grasp of evidence (Duncan et al., 2018).

Confronting elusive conceptual constructs
The final design principle highlights the benefits of confronting elusive and ill-defined conceptual constructs (e.g., emotions or "depth" of therapy).First, the need to analyze and disentangle murky definitions (which often characterize popular sources) required students to engage with the challenge of translating everyday psychological phenomena into operational definitions.Such efforts invited students to reflect on the complex relations between psychological phenomena and the operational definitions needed to study them scientifically.Students had to grapple with the ideal of valid methods and the social nature of objectivity in psychological science, which rely on aligning design choices with underlying decisions concerning the aims and limitations of a given psychological study.Further, this type of task problematizes assumptions about psychological phenomena and foregrounds the intricate ties between their definition and measurement.Evidence is no longer simply a means to "uncover" the essence of a given phenomenon, but rather one possible way to render it amenable to scientific study.
Engagement with elusive constructs can also provoke discussions concerning the social processes underpinning the production of scientific evidence, and efforts toward achieving objectivity.Namely, it raises questions about the aims of scientific studies-what do we choose to study and why?Thus, for instance, the question of comparing approaches to therapy goes beyond depth or efficiency and raises broader questions, from interrogating the values underpinning the definition of successful therapy (e.g., Successful for whom?And who determines success?) to asking which types of therapy are available to different groups in the population (e.g., Who can afford psychodynamic therapy?Or any form of therapy for that matter?).This implies directing students' attention to questions concerning the "for what, for whom and with whom" underlying the production and evaluation of evidence (Philip et al., 2018).
In addition, the need to tackle such issues collaboratively and agree on definitions for the phenomenon, as well as translating these definitions into empirical measures, revealed the complex and social nature of objectivity within the scientific study of psychological phenomena.Being charged with formulating ways to scientifically study psychological phenomena highlighted key aspects of scientific inquiry and their underlying epistemic ideals (Duncan et al., 2018;Ford, 2008).As in the case of flawed evidence, the need to discuss epistemic processes and ideals, to justify these to group members, and to reach some agreement concerning elusive phenomena can create opportunities for deepening students meta-epistemic understanding as well as their shared commitments to epistemic ideals.Specifically, they highlight the social processes underlying efforts to arrive at valid and reliable study of elusive psychological phenomena.

Limitations and directions for future research
At the same time, we are well aware of the limitations of the current study.To begin with, because our study did not include a control condition (for ethical reasons, in order not to disadvantage students in an important course), we cannot evaluate the contribution of the various novel aspects of the course design (i.e., hybrid learning, collaborative online tasks) compared to alternative forms of psychology instruction.Hence, our results can mainly shed light on how grasp of evidence evolved following a semester-long course and clarify how students collaboratively engage with popular psychological evidence.In addition, as mentioned above, the collaborative evidence evaluation and design tasks included a single information source only, so students did not have opportunities to integrate evidence or engage with more comprehensive bodies of evidence, which would have allowed them to explore more deeply the social and procedural nature of scientific evidence.Also, the tasks' emphasis on positioning students as "experts" came at the expense of paying more attention to lay evidentiary practices.Further, the pre-and post-evidence assessment tasks did not include highly unreliable sources or conflicts between sources, two elements that are known to increase awareness of source quality.Hence, future work should facilitate engagement with more complex, contradictory, and unreliable sources.Finally, future research could delve deeper into the GoE framework's manifestation in the context of psychological evidence, both by further exploring the ideals and processes presented here and by examining additional dimensions of the GoE framework (i.e., evidence analysis, evidence interpretation, and evidence integration).
Finally, as noted, this study focused on one specific course, guided by the experimentalist and quantitative approach to psychological research laid out by Professor A. Hence, in line with the overarching rationale of this paper, the evidence presented here should be approached critically and understood within the context of the epistemic practices highlighted in the course.This study has started to shift the focus from the natural sciences to the study of psychology.Future research could develop this arch and examine more diverse approaches characteristic of the social sciences by specifically attending to the culture, history, and power that are embedded in epistemological aims, ideals, and processes (Nasir et al., 2021;Philip & Sengupta, 2021).

Conclusions
To conclude, this paper offers several novel contributions to research on integrating evidentiary practices in psychology instruction which have theoretical and practical implications.The study addresses the gap in knowledge concerning how students reason about psychological evidence and how grasp of evidence can be promoted in psychology instruction.The study illustrates that it is possible to productively engage students in challenging evidence evaluation and design tasks even in large introductory courses.The analysis of these tasks suggests emerging design principles for cultivating meaningful engagement with psychological evidence, which invite further research.The study also shows that important changes in evidence evaluation practices are possible following a semester-long course and identifies three central trajectories of change in evaluation practices (shifts in sources of knowledge, shifts in views of objectivity, and shifts in definitions of psychological phenomena).Finally, the study shows that the GoE framework can fruitfully apply to a new context-evaluation of psychological evidence at the undergraduate level.Our study confirms key ideals and processes identified by Duncan et al. (2018) and also reveals additional ideals and processes that extend the GoE framework to a different discipline of inquiry.

Table 1 .
Summary of the collaborative online evidence evaluation and design tasks.

Table 2 .
Summary of evidence characteristics in the pre-and post-assessments.

Table 3 .
Summary of epistemic ideals and processes for evaluating psychological evidence that were identified in the study.

Table 4 .
Changes in evidentiary ideals and processes from pre-to post-assessment.Bold text denotes significance of changes from pre-test to post-test (McNemar test).
, created by using a 1 In the lectures they talked a lot about the importance of defining what we want to study and mostly that it would be reliable and valid.I'm with Michael.We need to understand what depth is in order to answer [the question].emotionswas not picked up by the rest of the group, this time the ambiguity of the construct became a central thread in the group's work.First, Lucas commented that "we need to understand what depth is in order to answer [the question]" (215).As in task 2, Lucas can be once again understood as focused on the aim of "meeting task requirements."However, his tone was more earnest this time, highlighting what was discussed in the class.As we can see in what followed, in this case his comment led the students to delve deeper into dilemmas concerning how to define therapy "depth" and what constitutes a good study.
my opinion, an in-depth therapy is a therapy that tries to understand a problem's causes, and this is true with respect to psychodynamic therapy.We can use a questionnaire to examine to what extent the patient and therapist feel they understand what the causes of negative thoughts are, and whether they can connect different thoughts to the same source, and if changes in compulsive thoughts are aligned with changes in other thoughts related to those sources.226 Ruby 08:57 It's deeper than a therapy that tries to change such negative thoughts or to accept them.OK, we can go with your idea, Michael, as long as we define in the beginning of our answer that in-depth therapy ¼ therapy that focuses on a problem's sources.