Remote Proctoring in Language Testing: Implications for Fairness and Justice

ABSTRACT In the wake of the COVID-19 boom in remote administration of language tests, it appears likely that remote administration will be a permanent fixture in the language testing landscape. Accordingly, language test providers, stakeholders, and researchers must grapple with the implications of remote proctoring on valid, fair, and just uses of tests. Drawing on an argument-based approach to fairness and justice, which subsumes validity, we articulate key sub-claims, warrants, rebuttals, and relevant backing related to the use of remote proctoring in language tests. With respect to meaningfulness as a core element of fairness, we focus on how remote proctoring is both a bulwark against construct-irrelevant responses (cheating) and a potential source of construct-irrelevant variance due to inauthentic constraints on test-taking conditions. Other fairness concerns relate to technological biases across racial/ethnic groups and access to suitable technology and physical space for remote proctoring. For justice, we consider the consequences and social values of remote-proctored language tests (Coghlan et al., 2021). We propose that these articulations of remote proctoring issues within Kunnan’s fairness and justice framework can usefully motivate and guide research on as well as critique of testing procedures and test uses.

a major concern, perhaps for good reason-Educational Testing Service, provider of the TOEFL family of language assessments as well as admissions tests such as the GRE, reported a 200% increase in score cancellations linked to test taker malpractice after the introduction of at-home testing (Nicosia, 2022).Without the standardization, physical controls, and direct monitoring of in-person proctoring (Slusky, 2020), stakeholders may reasonably doubt the trustworthiness of an individual's test results.At-home test providers have turned to a variety of remote proctoring solutions, which generally take advantage of webcams and sophisticated security software, including AI tools, to guard against malpractice (e.g., cheating) during test-taking.While the widespread institutional acceptance of test scores from online, remote-proctored exams suggests that remote proctoring has been considered sufficiently secure in practice, its effectiveness and ethics have also been questioned (Coghlan et al., 2021;Dawson, 2021Swauger, 2020;Visser-Knijff, 2020).
After describing current approaches to remote proctoring and related concerns in language tests, we focus on integrating remote proctoring concerns into arguments for fairness (equal treatment of all test takers), and justice (distribution of benefits and promotion of positive values in society) (Kunnan, 2018).Following Kunnan (2018), we focus these arguments on fairness and justice, which encapsulate validity concerns (i.e., the sufficiency of evidence supporting intended interpretation and use of test scores) (Chapelle, 2020;Kane, 2013).In doing so, we detail specific warrants and sources of backing, along with likely rebuttals, that test developers and test score users can draw on to justify remote proctoring practices and procedures.By putting fairness and justice center stage, we are expressly calling for language testers to consider not just whether and to what extent remote proctoring supports the intended interpretation and use(s) of test scores, but also whether and to what extent specific remote proctoring approaches are fair and just to test takers.

CURRENT PRACTICES IN REMOTE PROCTORING
Language test providers currently use a range of remote proctoring configurations, from relatively straightforward human monitoring of webcam video streams to highly sophisticated configurations that additionally employ advanced technologies like artificial intelligence (AI) and biometric security.While some level of human involvement is now seen as a virtual necessity in remote proctoring for high-stakes tests (Ascione, 2021;Kelly, 2021;Slusky, 2020), there is a clear propagation of technology-based solutions compared to in-person proctoring.The reason for this is readily apparent: In-person test center proctoring can rely primarily on administrative and physical controls to maintain security that are less available, if not completely unavailable, in remote proctoring (Slusky, 2020).Where remote proctoring falls short in controlling the test-taking environment, it must compensate with enhanced monitoring.
Based on approaches to monitoring test takers during exams, four basic types of remote proctoring systems are identifiable: no monitoring, live (human) proctoring, recorded proctoring, and automated proctoring (Michel, 2020;Nigam et al., 2021).No monitoring, or an "honor code" approach (Michel, 2020), relies on test takers' integrity and/or limited security technologies, such as lockdown browsers or secure exam platforms that limit computer functions, to prevent malpractice.
The remaining remote proctoring systems all require audio/video capture via webcam and are often combined to form hybrid approaches.Live proctoring is a real-time endeavor that involves monitoring by a human proctor.This system is most analogous to in-person proctoring, though potentially less efficient, with one proctor capable of monitoring perhaps 7-8 examinees simultaneously (Draaijer, 2017).Remote live proctoring may also be less effective, as the remote proctor can only see and hear what a webcam picks up.Recorded proctoring stores webcam footage and potentially other records, such as computer activity logs, that can be used for post-hoc malpractice detection.This approach affords greater efficiency, as less simultaneous manpower is needed, and creates a strong evidential record.However, on its own, this approach has no capacity to address malpractice in the moment and end a test administration if needed.
Finally, automated proctoring relies on technological solutions to malpractice detection, typically drawing on specialized AI in the areas of computer vision and audition.Table 1 provides a brief overview of these AI applications, and readers should refer to Burgess et al. (2022), Choi et al. (2021), Nigam et al. (2021), and Slusky (Slusky, 2020) for more detailed descriptions of these tools.Automated proctoring is potentially efficient, but perhaps not when considering the accuracy of monitoring: AI systems are (currently) imperfect and have difficulty separating signal from noise when identifying malpractice.For instance, recent efforts by ETS in developing keystroke biometric technology have demonstrated an overall error rate of 4.7% in same-author identification, which exceeds the industry benchmark of 5.4% (Choi et al., 2021).This is impressive and could be quite useful for screening at scale, but at the same time, misidentifying around 1 of every 20 test takers would lead to numerous negative outcomes if relied on exclusively.Thus, AI's role in detecting potential malpractice must be distinguished from making decisions about whether malpractice occurred: An AI system may be reasonably efficient at the former yet not accurate enough to handle the latter without human judgment (Dawson, 2021).
The limitations of any one remote proctoring approach can be mitigated by using several systems in tandem.Michel (2020, p. 29)  -Detecting suspicious behaviors (e.g., using prohibited computer resources, attempting to take screenshots).
-Identification of a test taker or detection of an unauthorized assistant/stand-in during or after a test.

System Monitoring
Computer activity, such as open windows and data transmissions, is analyzed by AI.
-Detecting unauthorized resources (e.g., additional windows) While Michel's hierarchy has not been empirically verified, its logic appears sound: Using more tools, with live proctoring as a centerpiece, creates the greatest potential for secure test administration.
While monitoring is often the focus in differentiating approaches, remote proctoring is not limited only to monitoring test takers.Check-in and check-out procedures are also important for maintaining test security, just as in in-person administration.Check-in procedures typically include ID verification and a "room check."ID verification increasingly employs biometric technology and might require capturing a still webcam image to check against pre-submitted photo ID and/or collection and analysis of other biometric data for later processing (e.g., voice sample, fingerprint, keystroke behaviors).Specific procedures for room checks vary, but test takers are typically asked to provide a 360° view of the room via webcam and confirm that no prohibited materials are accessible.After a test has been completed, check-out procedures may also be required and may include the review and/or destruction of notes taken during the exam.
With the numerous options for monitoring and other security measures in remote proctoring, and the apparent security benefits of deploying multiple tools/procedures in tandem, it is unsurprising that approaches taken by language test providers differ noticeably.Some test providers rely on recorded rather than live proctoring, some allow notes (on paper or whiteboard) and others do not, and some require use of headphones while others prohibit anything covering test takers' ears.Yet another option for test providers is delegation of proctoring.While it may seem intuitive for a test provider to directly oversee proctoring in-house, especially for an online test, several major test providers and many smaller testing programs rely on third-party proctoring services.To be clear, this option is not novel -in-person proctoring of large-scale exams has long been contracted to independent organizations who meet test providers' training, facilities, and audit requirements.That said, our point is this: Just as test providers adopt different methods for assessing language abilities, in terms of task types, conditions, response formats, and scoring procedures, for which justifications must be articulated, they must also justify their approaches to maintaining test security and administration conditions in at-home testing.

REMOTE PROCTORING CONCERNS
Our review of remote proctoring approaches so far has focused on maintaining test security, an obligation of all test providers (AERA, APA, & NCME., 2014;ILTA, 2020;ITC, 2005) regardless of test delivery mode.However, the tools and approaches for maintaining security in online tests, as well as the practices of several third-party remote proctoring providers, have raised concerns among advocacy groups (e.g., Kelly, 2021;Kelly & Oliver, 2020), academics (Coghlan et al., 2021;Nigam et al., 2021), and politicians (e.g., several U.S. Senators in Blumenthal et al., 2020).These concerns demand serious consideration from language testers.One major concern is privacy.As in test centers, remote proctoring collects sensitive data from test takers (e.g., national ID) as well as extensive recordings (e.g., hours of video) that are transmitted and stored digitally, which is especially concerning given past occurrences of data breaches from major proctoring service providers (Kelly & Oliver, 2020).Other concerns relate to the use of AI proctoring tools.The accuracy of such tools for detecting cheating has been questioned (Kelly, 2021) and there are longstanding concerns about racial, ethnic, and gender bias embedded in underlying algorithms (Buolamwini & Gebru, 2018;Caplan-Bricker, 2021;Grother et al., 2019;Mitchell et al., 2019;Swauger, 2020).Provision of adequate support for test takers is also a concern.Individuals with disabilities may have rights to certain accommodations (e.g., alternative input devices, presence of an assistant) or demonstrate behaviors (e.g., gaze or body movements) that may not be well supported by remote proctoring restrictions or tools.Specific to language testing, lower proficiency examinees may have difficulty following the instructions and troubleshooting directions of remote proctors who lack sufficient training or linguistic skills.Finally, lack of transparency is a concern in remote proctoring which can exacerbate those previously mentioned (Dawson, 2021).When done remotely, proctoring is a much less public practice which can make it more difficult for honest test takers to observe and report misbehavior of proctors or other test takers.
In sum, language testers must weigh remote proctoring's expected security benefits against its potential harms.In many small-scale and low-stakes assessments delivered remotely, implementing few to no remote proctoring procedures may be the most reasonable course of action, as potential harms may outweigh minimal potential benefits.Nonetheless, where security concerns prevail, we propose that addressing remote proctoring through the lenses of fairness and justice, which include validity concerns, will better position language testers to address legitimate concerns and collect evidence that ultimately supports or demands revision to remote proctoring practices.

FAIRNESS AND JUSTICE: KUNNAN'S (2018) ARGUMENT-BASED APPROACH
Validity has long been the prevailing concern in language testing.Originally framed as a matter of whether a test measures what it purports to measure, the scope of validity now includes the appropriateness and consequences of test score use (AERA, APA, & NCME, 2014;Chapelle, 2020;Kane, 2013).Argument-based approaches to validity influenced by Kane (1992Kane ( , 2013) ) have become mainstream in language testing and provide a useful structure for validation activities (e.g., Chapelle, 2020;Knoch & Chapelle, 2018).In Kunnan's (2018)framework for evaluating language assessments, the concepts of fairness and justice are foregrounded and specified in detail.Validity, which Kunnan characterizes as the "meaningfulness" (Kunnan, 2018, p. 142) of test scores for a particular decisionmaking context, is seen as a part of, or a precondition for, fairness and is therefore subsumed in the same argument.Justice, built on fairness, receives top billing and its own argument in this framework, rather than being dealt with in terms of test impact at the end of a chain of inferences in validity-centric frameworks like Chapelle (2020) Kunnan's framework (Kunnan, 2018) consists of two overarching principles, one for fairness and one for justice, that must be followed to achieve fair decisions and just social outcomes.Each principle is further specified by several sub-principles and related subclaims, each with warrants that must be met for a claim to hold.Whether warrants are met is determined by the availability of backing (i.e., evidence), with more and/or stronger backing expected when the stakes of a test are higher (and conversely, warrants in lowerstakes tests can be satisfied with less backing, see also Kane, 2013).Backing can come from many sources, ranging from test documentation (e.g., design documents, administration procedures) to empirical evidence collected in operational or research settings (e.g., test score data, proctoring records).We note here that while documentation of test design and procedures does constitute relevant backing for warrants, empirical backing will generally constitute stronger backing.Put another way, it is one thing to have a procedure on the books, and another to demonstrate how well the procedure works.Importantly, all argument-based approaches allow for rebuttals that challenge these warrants.Where a warrant, and in turn its parent sub-claims and principles, is not well supported by evidence, the larger argument may be called into question.
The concepts of fairness and justice have recently attracted a long-overdue increasing attention in language testing (Deygers, 2019;Kunnan, 2018;McNamara et al., 2019;Shohamy, 2001;Xi, 2010).Fairness has most often and for some time now been examined in terms of test score interpretations and uses across identifiable subgroups of test takers, that is, i.e., whether the test scores and interpretations mean the same thing, scores are equally generalizable, and decisions and consequences are equitable for different groups of people.If not, test score interpretations may not be well supported and resulting decisions are at great risk of being unfair Xi (2010).Matters of justice have received less attention in language testing, perhaps because discussion of justice is inherently value-laden and less clear-cut than matters of fairness or validity (e.g., evaluating reliability coefficients).This limitation is addressed by Kunnan (2018) who in his framework foregrounds fairness and justice, with considerations of test score meaningfulness (i.e.validity), as building blocks thereof.Although mainstream validity frameworks (Chapelle, 2020;Kane, 2013;Messick, 1989) are explicitly concerned with test consequences and, in this sense, overlap considerably with fairness and justice (see also Xi, 2010), we agree with Kunnan that the terms fairness and justice are likely more effective for communicating issues of concern to nonspecialist stakeholders, and his well-elaborated fairness and justice framework has been useful in guiding our thinking.Kunnan (2018) defines fairness as "a presumption of treating every test taker with equal respect" (p.80).Evaluating fairness is necessary because valid score interpretations, according to Kunnan, do not in and of themselves guarantee equitable treatment of test takers.Putting the test taker at the heart of these considerations is clearly relevant when it comes to remote proctoring of assessments.In considering how remote proctoring could impact valid and equitable test score interpretations, we begin with Kunnan's Fairness Subprinciple 2, followed by arguments pertaining to equitable consequences of score use related to Fairness Sub-principle 4. Kunnan's (2018) Fairness Sub-Principle 2 states that "an assessment ought to be consistent and meaningful in terms of its test score interpretations for all test takers" (p.80, emphasis in original).Score interpretations are meaningful when the scores assigned to individual test takers are accurate and relevant indicators of their performance on test tasks.Among several sub-claims that license Sub-Principle 2, one must first have confidence that the scored observations of test taker performances were elicited in secure, standardized conditions.Table 2 lists key warrants of this sub-claim alongside potential rebuttals and likely sources of backing which could constitute evidence for/against warrants.

Fairness sub-principle 2: meaningfulness and consistency
The first warrant (W1) holds that test administration conditions facilitate observations of relevant performances.With remote administration and proctoring, the stability of testing related to internet connections is a concern, as home internet and Wi-Fi may generally be less stable than in well-equipped test centers.Moreover, spotty internet connections may be flagged by proctoring software as potential malpractice, leading to test interruptions by a human proctor.Test providers should equip their proctors with tools for assisting test takers in resuming interrupted tests and make technical checks available before (a) registering for a test and (b) immediately before beginning a test.In the case of AI-supported proctoring, providing evidence that these monitoring systems do not inaccurately flag test takers as suspicious due to hardware malfunctions, such as a camera losing focus (Muhammad & Ockey, 2021), constitutes relevant backing.
Warrants 2-4 pertain to security and malpractice.At-home test administration raises concerns about how effectively test content can be kept secure and whether test taker malpractice, including the use of unauthorized aids (e.g., answer keys, reference tools, scripted responses) or stand-ins (e.g., someone else impersonating you during a test, someone remotely controlling your computer), can be prevented or detected.1 Implicit in the assumption that scores are derived from genuine observations of ability, test takers Some test takers hire stand-ins to take the test for them.
-Test score user reports of ID fraud.
-Penetration testing to evaluate effectiveness of current systems/procedures and make necessary modifications.
suspected of malpractice should be investigated and their scores canceled by appropriate criteria.Remote proctored test providers must provide evidence that they have such investigation procedures and criteria in place (Muhammad & Ockey, 2021).Post-hoc statistical analysis of test score data can provide circumstantial evidence of test content security through analyzing differences in item difficulty across time and location, and cheating through similarity and/or aberrance of response patterns, or unusual gain scores (Cizek & Wollack, 2017).Test providers can proactively address identity concerns by providing end users (i.e., institutions) with post-hoc ID verification checks beyond photographs, which may be of poor quality when taken by a webcam.Several major test providers do this by providing audio/video recordings of test takers (e.g., a sample speaking response or a non-test interview question) to institutions for review and comparison to an individual who has joined their institution.Content-matching software can also be used to capture scripted or plagiarized responses before scores are awarded but establishing the effectiveness of such software requires empirical support.
Strong backing for the effectiveness of test security measures should come from penetration testing in which external researchers and/or cybersecurity experts endeavor to "crack" a test provider's security systems.While cybersecurity might connote the protection of online systems from hacking for many language testers, penetration testing would also entail attempts to circumvent proctoring by attempting to receive unauthorized aid during a test, submit scripted responses, use fake/manipulated IDs, etc.; research would report how successful such efforts were.Dawson (2021) discusses the challenges of such research, where permission for researchers and transparency of reporting are major issues, and comes to the conclusion that "responsible disclosure" is a reasonable compromise: Independent researchers would first notify test (or proctoring) providers of vulnerabilities, allow the provider sufficient time to resolve issues, and then report on the situation and its redress (assuming vulnerabilities have been resolved) with censoring of details that could be exploited by malevolent actors.We have so far seen little of this in language testing, but recent efforts by Duolingo (Basim Baig & Wodzak, 2021), while perhaps too vaguely reported to constitute strong backing, are encouraging.
Beyond summarizing performance on test tasks, test scores are typically intended to be interpreted as meaningful indicators of a targeted construct (e.g., language proficiency).In the case of tests that favor a target domain description over a clearly defined construct, it is inferred that the scores reflect performance consistencies in a specified domain (e.g., English in academic settings) (Mislevy, 2018).By extension, a sub-claim that licenses Kunnan's Fairness Sub-principle 2 is that scores reflect an individual test taker's standing (level) on the target construct (Table 3).A key warrant (W1) of this sub-claim is that the test taker's response processes align with knowledge of mechanisms underlying the target construct.
Remote proctoring, in particular how AI tools based on examinee gaze and head/bodily movements are used to flag potential malpractice when deviations from "normal" or prescribed behaviors are detected, presents a threat to the cognitive processes of test takers.Research shows that it is common for test takers to avert their gaze from the screen to take notes during listening tests (Suvorov, 2015) or when preparing responses to speaking prompts (Burton, 2023).Such behaviors may trigger AI flags, which in turn could lead a proctor to disrupt a test taker mid-exam.Moreover, there is limited evidence that visuals in listening tests positively influence listening comprehension (Batty, 2015;Cubilo & Winke, 2013), making screen gaze of questionable relevance to assessed constructs.Furthermore, there is considerable variability in test-taker gaze behavior during computerdelivered listening and speaking tests (Burton, 2023;Ockey, 2007), which problematizes AI malpractice detection based on deviations from "normal" gaze behavior.Test takers who become aware of how AI monitoring software works may consciously attempt to control their physical behaviors in ways that are not conducive to best performance -keeping one's gaze locked on the screen might lead to fewer proctor disruptions or less risk of score invalidation, but it might also make comprehending academic lectures more difficult.
Studies of test taker behavior and response processes under remote proctoring (perhaps in comparison to in-person proctoring and/or target domain language use) can provide relevant backing to address Warrant W1.Similarly, linguistic analysis of constructed responses in remote proctored conditions, especially when compared to test-center administrations, could provide backing for speaking and writing tasks.Additionally, analysis of AI flags and interruptions/cancellations of tests would help to quantify how often reasonable on-task behavior, indicative of typical response processes, is misidentified as potential malpractice.Webcam footage and audio recordings could be analyzed directly or used as stimulated recall material with test takers and would likely constitute powerful evidence pertaining to construct-relevant behaviors and response processes.
Fairness sub-principle 4: procedural equity Among Kunnan's (2018) four sub-principles for fairness, Sub-Principle 4 is especially relevant to remote proctored assessments.It stipulates that "[a]n assessment ought to use appropriate access, administration, and standard-setting procedures so that decisionmaking is equitable for all test takers" (Kunnan, 2018, p. 80, emphasis original).We address aspects of this sub-principle most impacted by remote proctoring through four warrants.Tables 4, 5, 6, and 7 present these warrants and their key assumptions alongside potential rebuttals and sources of backing.
While the first sub-claim of Fairness Sub-Principle 4 (Table 4) is certainly one that would apply to in-person proctoring as well, it is especially pertinent in remotely administered tests, where proctoring might be done by third-party providers whose staff might have limited training or experience in dealing with non-native language speakers.While on-site administration is often handled by speakers of the same language as the language learners, remote proctors may operate internationally, which might be a particular concern when it comes to beginner learners of the language being examined (Green & Lung, 2021;Purpura et al., 2021).Test providers of remote-proctored exams should therefore ensure that proctors are supported and well trained in providing accessible instructions or have competence in test takers' L1 to communicate procedures clearly.In addition, incident log inspection and surveying test takers about their remote proctoring experience should be standard practice.We note that a lack of live human involvement in remote proctoring, such as approaches that use AI and video recording for post-hoc review, is a major limitation on support for test takers.
Similarly, to support the second assumption for this warrant, it seems again even more important for remotely administered tests that familiarization materials and instructions are provided in a simplified form, "how-to" video form (Muhammad & Ockey, 2021) or in several languages to ensure accessibility and understanding of procedures.Duolingo, for instance, offers their official guides in nine different languages at the time of writing, and ETS's website offers support in five non-English languages for the TOEFL family of exams.
The second sub-claim (see Table 5) is again applicable to face-to-face administration, but poses particular challenges for, or rather requires particular backing evidence by, providers of remotely proctored assessments.In test centers, it is comparatively easy to ensure suitability and standardization of test conditions.In the case of remotely proctored exams, the technology requirements might render the assessment inaccessible to test takers with fewer resources (Green & Lung, 2021;Purpura et al., 2021).Data comparing access to online and in-person tests can help monitor the assumption for this warrant (Muhammad & Ockey, 2021).It must be noted, however, that providers of remote-proctored language exams are generally already providing good evidence that their respective technology requirements are reasonable, not excessive, and clearly communicated to interested test takers in advance (Isbell & Kremmel, 2020).
For Sub-Claim 3 (Table 6), remote proctoring again demands specific considerations.While accommodations for test takers with disabilities are now standard practice in many in-person language test situations, particularly in high-stakes tests, the implementation of Test-takers with low digital literacy find it hard to access the test.
Tools exist that allow test takers to preverify their understanding and setup for successful remote proctoring.
such accommodations requires unique protocols when it comes to remotely proctored exams.Test-takers with disabilities may have difficulty complying with remote proctoring demands (e.g., gaze aversion) and/or require assistance with some standard procedures (e.g., 360° room check).More generally, receiving appropriate assistance (e.g., from a human or from assistive technology) might be identified as malpractice by some proctoring approaches.Adhering to the principle of fairness here implies that the test provider has clear documentation and protocols in place that detail how test-takers with disabilities can be assisted during the completion of a remotely proctored test, and how proctors will be trained to invigilate exams for these test takers (Phillips & Camara, 2006).The last sub-claim for Fairness Sub-Principle 4 (Table 7) is most pertinent in remote proctoring approaches supported by AI.Computer vision technology, such as facial recognition AI, may present the most substantial threat.These AI monitoring systems have been shown to be prone to racial and gender bias (Buolamwini & Gebru, 2018;Burgess et al., 2022), which renders them highly problematic in the face of ensuring consistent test   administration conditions for all test takers.There is mounting evidence that AI systems tend to disproportionately or inaccurately flag and cause interruptions for non-white (e.g., Black, Asian) test takers with darker skin tones (Burgess et al., 2022;Feathers & Rose, 2020;Swauger, 2020), which may be due to limitations in AI training data (e.g., lack of diverse skin tone representation) or other design oversights (e.g., not accounting for sub-optimal lighting).
Comparisons of AI proctor flags and proctor interventions across different ethnic groups, for instance, or provision of "Model Cards" for AI systems ("short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups [. ..] and intersectional groups") (Mitchell et al., 2019, p. 220) are suitable approaches to address fairness concerns in this area.Furthermore, care needs to be taken and evidence provided that AI systems do not flag non-malpractice background noises (Ascione, 2021), which could disproportionately affect test takers in urban areas or those whose living situation features more human density.In addition, in scenarios where both in-person and remote proctored exams are provided by the same test provider, evidence should be provided that the at-home administration mode does not cause disproportionately increased anxiety with test takers due to any perceived violations of privacy.This is particularly relevant given reports of remote proctor misconduct (Chin, 2020).

Justice
The concept of justice in testing incorporates the implicit social and political values of the tests and underscores that the use of the test should create beneficial effects and consequences and yield positive societal values (Kunnan, 2018).Based on Kunnan's two subprinciples of justice, remote-proctored tests would be just only if they benefit the test-taking community and promote social justice.Further, in the case of tests with multiple modes of delivery, any injustice in an at-home version is a threat to the justice of the test suite overall.
A sub-claim related to Kunnan's (2018) first sub-principle of justice, pertaining to beneficial consequences, states that online proctoring benefits the test-taking community (Table 8).The underlying warrant of this sub-claim is that a remote-proctored test with effective security will foster trust among those involved in it.When the security of the remote invigilation is respected and recognized by interested parties, the decisions made by the tests can be trusted.So far, language teachers have had mixed reactions to remote proctoring, noting concerns about anxiety, mistrust, and credibility of remotely proctored online tests (Ironsi, 2021;Valkova, 2021).Surveys of and/or interviews with diverse stakeholders could be used to investigate levels of (mis)trust in a remotely proctored test.Sub-Claim 2, licensing Kunnan's second justice sub-principle, pertains to promoting positive values and advancing justice (Table 9).The first warrant (W1) of these sub-claims requires that the test supports and protects test takers' privacy.Invasion of privacy has been a major concern in remote proctoring (Isbell & Kremmel, 2020;Muhammad & Ockey, 2021) because many online proctoring systems are invasive, allowing proctors access to test takers' laptops, official identification documents, and home environments.Several studies (e.g., Karim et al., 2014;Purpura et al., 2021) have shown that test takers indeed find privacy concerns distressing.Many online proctoring tools, such as secure browsers or browser extensions, are programmed to lock down certain functions of a personal computer and track what the test taker does in a fashion not dissimilar to spyware (Dawson, 2021;Lupu, 2021) and create potential security vulnerabilities for cyber criminals to take advantage of after a test is complete (Swaak, 2022).There is also a stark asymmetry in remote proctoring privacy: Test takers are intensively monitored but unable to see what proctors are doing.In contrast, in a typical test center, a test taker can see the proctor and what the proctor is doing.Remote invigilation thus presents a challenge in detecting the proctors' potential negligence or malfeasance.Procedures for monitoring human remote proctors' conduct should be in place, and independent audits would address concerns about proctor misconduct.
However, procedures to protect test takers' privacy are already in place and are gradually becoming mandated.Professional organizations such as the ITC (2005) and ILTA (2020) have made explicit calls to protect the privacy of test taker data and follow relevant government privacy regulations.Some test providers, such as Duolingo and Pearson, follow the European Union's General Data Protection Regulation (GDPR) which has exerted considerable international influence on technology platforms and cybersecurity.To comply with GDPR, organizations need to ensure consumers' right to withdraw consent (Article 7) and right to be forgotten (Article 17) (Li et al., 2019).Compliance with GDPR is strongly recommended for remote testing that deals with behavior and biometric data (Kleeman, 2020;Slusky, 2020).Currently, however, only 7 out of 29 commercial proctoring services are in compliance with the GDPR (Arnò et al., 2021).Besides the GDPR, most high-stakes remote proctored tests including TOEFL and IELTS specify the types of information they collect and how long the information is stored.Independent audits of data collection and retention practices would constitute relevant backing for Warrant 1, as would disclosures of breaches or data leaks.
The second warrant (W2) of Sub-Claim 2 is that socioeconomic status is not a barrier to taking the test.As discussed in the fairness section, test takers who come from different Monitoring, records of proctor behavior.
W2.The test is accessible to test takers of all socioeconomic backgrounds.
The remote-proctored test costs as much or more than in-person test administration.
(Affordable) price of remotely proctored tests; price comparisons between in-person and remote-proctored tests.
financial, geographical, personal, and educational backgrounds should have equal access to tests (Kunnan, 2018).Similarly, based on whether remote proctoring reduces or worsens the financial burden of test takers, the justice of a test may be enhanced or degraded.Some tests, such as the TOEFL iBT and Pearson Test of English, do not charge more for web-based testing, and Duolingo claims that the DET's low cost is attributable to the test being remoteonly.Although some argue that online proctoring can be (Dawson, 2021;Karim et al., 2014) or should be cost efficient (Moghadam & Nasirzadeh, 2020;Okada et al., 2015) and could thus enhance access and justice, there are also examples of remote proctoring incurring extra charges for test takers.Some ACTFL exams, for instance, charge an additional $4-5 to take them at home with remote proctoring (ACTFL, 2021).Depending on the system used (e.g., involvement of AI or human proctors and security software), online proctoring may or may not be less expensive than tests that are administered at test centers.Remote proctoring has the potential to better promote justice when it is more affordable and thus accessible to test takers of varying socioeconomic backgrounds, especially if it eliminates substantial costs associated with travel to test centers.The third justice sub-claim, related to Kunnan's (2018) second sub-principle for justice, requires that the test providers have procedures in place for contesting test decisions (Table 10).This sub-claim hinges on two warrants.Specifically, the first warrant (W1) for this sub-claim is that those test takers whose scores are canceled and denied further attempts did commit malpractice.When some test takers are accused of cheating, they might maintain that the accusation is false.False allegations of cheating could lead to more serious consequences than disregarding malpractice (Dawson, 2021).To avoid such controversy, test providers should collect sufficient evidence of malpractice.Video and audio recordings of the test takers, computer screen, and testing environment can be recorded during remote proctoring.If malpractice did occur, test providers should be able to produce sufficient evidence.If malpractice cannot be established, or a test taker unintentionally committed a minor violation of test-taking rules, some leniency and a reattempt seem justified.However, test providers are required to take suitable measures when cheating happens.The value of justice will be well respected not only when test providers accurately identify cheating but also when they impose appropriate sanctions on cheaters (Watson & Sottile, 2010).It is still a matter of much debate, however, what an appropriate consequence for cheating actually is.Test takers also are not told why their exam is canceled.
Test takers are clearly informed why their test was canceled.Test takers have no recourse when their exam is canceled by a remote proctor.
Test takers are given an opportunity to file a complaint.
The second warrant (W2) is that test takers and users have recourse to the test provider.When taking action against those suspected of malpractice, test providers should present test takers with solid evidence or reasons for score cancellation.Not knowing specifically what one has been accused of puts test takers at a disadvantage when making an appeal.Proactively providing test takers with reasonable criteria for score cancellation and comprehensible instruction for what is considered to be cheating would support both the fairness and justice of the test (Muhammad & Ockey, 2021).The ILTA Guidelines for Practice (ILTA, 2020), for instance, specify the test takers' right to voice their concerns about the testing process or the test results.For a test to be just, all test takers and test users should have easy access to clearly defined procedures for levying complaints or appealing malpractice-related decisions.Any test taker with low target language proficiency or disabilities, for example, should be able to understand the appeal procedure and make complaints to the test provider.

DISCUSSION
At-home test administration and remote proctoring are now part of the language testing landscape.Remote proctoring has yet to be extensively treated in the language assessment literature, as most language test proctoring was in-person up until at-home administration of language tests was accelerated by the COVID-19 pandemic.While it would be inaccurate to claim that test providers have not been concerned with issues of test security and at-home proctoring, security and remote proctoring have mostly been treated as separate from fairness, justice, and validity -evaluated under more technical, quality-assurance criteria rather than being formally integrated with how test scores are interpreted and used.
Along these lines, we have found Kunnan's (2018) approach to fairness and justice helpful for addressing issues related to remote proctoring, namely for its elaborated set of sub-principles that guided our thinking.We drew on our hands-on experiences with remote proctoring and reading of the relevant literature and were readily able to make connections to Kunnan's principles; from there we articulated more specific arguments by identifying how remote proctoring issues could pose threats to the (sub-)principles of fairness and justice.While Kunnan's framework proved generative for us, we also acknowledge some limitations of our work.First, the arguments we presented are almost certainly not exhaustive, and it remains to be seen how critical each is in practice.We hope that future examinations of empirical evidence related to remote proctoring by test providers and researchers will lead to further refinements of fairness and justice arguments.Second, while we believe the concepts of fairness and justice resonate with test takers and many test users, some test providers and researchers may find our focus on fairness and justice, rather than validity, to be incongruent with their own priorities or current/ongoing test evaluation and improvement efforts (e.g., some test providers have presented elaborated validity arguments used to organize validation activities).We do not advocate for purism and encourage others to adopt/adapt our arguments in other organizing frameworks as needed -most important, in our view, is a critical examination of remote proctoring on test taking.As noted previously, parallels can be drawn between Kunnan's framework and popular argumentbased approaches to validity.Specifically, our arguments related to Fairness Sub-Principle 2, Sub-Claim 1 and Fairness Sub-Principle 4 could be incorporated into the evaluation inference of Kane (2013) and Chapelle (2020), while Fairness Sub-Principle 2, Sub-Claim 2 most neatly aligns with the explanation inference.For Justice, our arguments largely align with the inferences Chapelle (2020) labels utilization and consequence implication.
With regard to the use of AI in remote proctoring going forward, we see reasons for both optimism and continued skepticism.There is little doubt that AI technology will continue to make (rapid) improvements, leading to greater accuracy that will reduce threats to fairness and justice.Parallels might be drawn to automated constructed response scoring tools, which are now commonly used and less widely critiqued for scoring written, and increasingly spoken, performances in language assessments.Still, there currently is and may always be a need for human judgment and measures to correct for misidentification of malpractice, and in some cases, the use of AI tools may not be justifiable.Moreover, skepticism of and critically oriented, independent research on the effectiveness of AI appears crucial for its development.For example, Buolamwini and Gebru (2018) highlighted racial and gender biases in widely used facial recognition tools, and in turn the developers of those tools wrote public responses and made measurable improvements that reduced (though not eliminated) group disparities (Raji & Buolamwini, 2019).Research on remote proctoring technologies used in language tests, including computer vision and input monitoring, could similarly lead to technical improvements and reductions of potential harm to test takers.

CONCLUSION
We hope that our framing of remote proctoring in the language of fairness and justice arguments, which subsume validity, will motivate the field to view proctoring as more central to the testing enterprise and more worthy of research.Just as Knoch and Chapelle (2018) noted in their review of validity implications for rating procedures, we cannot claim that our presentation of arguments related to remote proctoring is comprehensive.Nor have we provided a simple template to fill in.Rather, test providers, test users, and (ideally, independent) researchers may find our examples useful for framing their own contextspecific practices and concerns in terms of fairness and justice.Similarly, with our emphasis on concerns with remote proctoring as the departure point (translating to rebuttals in arguments), we do not and cannot provide clear guidance on the "right" way to remotely proctor language tests.Aside from requiring some level of human involvement in the process, there is a lack of consensus on what constitutes best practices.That will be important work for the future, and treating proctoring as an important part of fairness and justice should motivate more and more transparent research on the topic.

Disclosure statement
D. Isbell has received research funding from British Council, Duolingo, and Educational Testing Service, honoraria from Educational Testing Service, and has consulted for IELTS UK and Duolingo.
B. Kremmel has received research funding from British Council, Cambridge English, and Educational Testing Service, honoraria from Educational Testing Service, and has consulted for IELTS UK.
J. Kim has received research funding from British Council and has interned for ACTFL.

Table 1 .
suggested a security hierarchy of remote proctoring configurations: Overview of common AI applications in remote proctoring.

Table 6 .
Fairness sub-principle 4, sub-Claim 3.Sub-Claim 3: The test is accessible to test-takers with disabilities, with appropriate accommodations.

Table 10 .
Justice sub-principle 2, sub-Claim 3.Sub-Claim 3: The test provider has transparent procedures in place for challenging test decisions.