Understanding and developing procedures for video-based assessment in medical education

Introduction: Novel uses of video aim to enhance assessment in health-professionals education. Whilst these uses presume equivalence between video and live scoring, some research suggests that poorly understood variations could challenge validity. We aimed to understand examiners’ and students’ interaction with video whilst developing procedures to promote its optimal use. Methods: Using design-based research we developed theory and procedures for video use in assessment, iter atively adapting conditions across simulated OSCE stations. We explored examiners’ and students’ perceptions using think -aloud, interviews and focus group. Data were analysed using constructivist grounded-theory methods. Results: Video-based assessment produced detachment and reduced volitional control for examiners. Examiners ability to make valid video-based judgements was mediated by the interaction of station content and specifically selected filming parameters. Examiners displayed several judgementa l tendencies which helped them manage videos’ limitations but could also bias judgements in some circumstances. Students rarely found carefully-placed cameras intrusive and considered filming acceptable if adequately justified. Discussion: Successful use of video-based assessment relies on balancing the need to ensure station-specific information adequacy; avoiding disruptive intrusion; and the degree of justification provided by video’s educational purpose. Video has the potential to enhance assessment validity and students’ learning when an appropriate balance is achieved.

different communication-related features (i.e. the face more than gestures) than someone watching the same scene live (Gullberg and Holmqvist, 2006). This may result from a reduced sense of volitional control or reduced expectation of interaction (Foulsham, Walker and Kingstone, 2011).
These effects may not simply emanate from the modality (i.e. video vs live) so much as what information is included in the video. For example, being able to see a person's head and shoulders rather than just their face increases the empathy of watchers (Nguyen and Canny, 2009). Collectively these findings suggest that the manner of presentation of video information could importantly influence the processes of assessment judgements.
Extrapolating from these studies', we might posit that a range of factors could influence the way that assessors make video-based judgements compared to live judgements: a narrower focus of vision, reduced interpersonal interaction, or reduced volitional control might all influence assessors' attention, recall, empathy or alter the salience of particular features of performances. As assessment judgements are, at least in part, intuitive , and sensitive to both context (Yeates et al., 2012) and attentional salience (Gingerich, Schokking and Yeates, 2018), these and possibly other processes have the potential to explain why video scores (despite general similarity) have differed from live judgements on some occasions and to explain whether there are conditions which make it more or less likely for differences to arise.
Whilst variations in examiners' judgements between video and live modalities could threaten the validity of resulting video-based scores, any undue influence of filming on students' performances would also constitute a source of construct irrelevant variance (Amin et al., 2011). In sports science, the combination of an audience and video recording has been shown to reduce performance for self-conscious individuals, whilst improving it for others (Wang et al., 2004). Whilst test anxiety in OSCEs due to an awareness of examiners is well described (Harrison et al., 2014), it is unclear whether videoing students' assessment performances might add to this sense, thereby unduly altering performance for some students.
As a result, whilst video use has the potential to enhance assessment in several ways, there may be a variety of largely unexplored influences on, or implications for, video-based performance judgements in health professions education. As these poorly characterised processes could have important unintended consequences for the validity, fairness or acceptability of video-based assessment judgements, we sought to understand the process of video-based judgements whilst seeking to establish whether there are conditions which will help to ensure their optimal use, by addressing the following related research questions:

1.
How do examiners' judgemental processes compare when judging video-based performances and live performances?

2.
What filming procedures are needed for different types of assessment tasks to maximise the likelihood of the processes of video-based judgements being equivalent to live judgements?

3.
How do students and examiners experience and interact with video in assessment, and what conditions are needed to minimise any resulting threats to assessment validity?

Methods:
We used design-based research (Baumgartner et al., 2003) to explore and develop a theory of videobased assessment whilst iteratively developing our filming approach. Design-based research enables development of a learning environment through continuous cycles of design, enactment, analysis, and redesign, whilst simultaneously developing educational theory. Data are typically collected through a mixture of methods, which may include surveys, measurements, observations, field notes, brief conversations with participants, think-aloud, interviews or focus groups ( (Cobb et al., 2003). In order to manipulate both assessment scenarios and filming conditions without prejudicing actual examinations, we used simulated Objective Structured Simulated Examinations (OSCE) stations (Newble, 2004) which were both videoed and examined live. We principally collected data through participant interviews with examiners and students and documentation of researchers' observations. We additionally performed a number of examiner and student focus groups to determine whether additional perceptions were co-constructed within resulting dialogues.
Population, sampling and recruitment: Our study populations were undergraduate OSCE examiners and clinical years medical students from Keele School of Medicine. We purposively sampled participants from a variety of ethnic backgrounds, with English as either a first or second language, and from different regions of the UK.
We sampled novice and experienced examiners from a variety of specialities.  2). Examiners were provided with detailed station information (marking criteria, simulated patient scripts, student instructions).
After reading station instructions for the OSCE scenario, students were asked to enter the OSCE station and perform as they would in a real OSCE. In most iterations, two examiners were present: one in the room (the "live" examiner), and one outside (the "video" examiner). The video examiner was provided with the same station information as the live examiner, and asked to judge the video performance in the same manner as they would a live performance. The video examiner watched the same performance the live examiner had judged via video immediately (within 20 mins) after the live performance whenever possible or following a delay (up a few days) if the examiner's availability made this pragmatically necessary. Whenever participants' availability allowed, examiners crossed over between the live and video roles and judged a second student performing at the same station.
Both live and video examiners scored the performance and considered the feedback they would give.
Filming approaches: Researchers filmed the simulated OSCE stations using a variety of filming methods. This included fixed, wide angle ceiling cameras (identical to filming in Yeates et al (2019); tripod based camcorders, positioned in various places within the room, and using varied degrees of zoom; head cameras worn by the examiner; and wall mounted pan / tilt / zoom CCTV cameras positioned in various places within the room (see figure 1). Sound was collected variously using ceiling hanging microphones; focused microphones placed on the camcorders; and table-top condenser microphones. Camera positions and settings were documented and iteratively developed (see figure   1 for an example).

Examiner interviews
Examiners were asked to persist in the frame of mind evoked by examining whilst they completed score sheets. Although both "live" and "video" examiners were asked to note the feedback they would give, the "live" examiner verbally communicated their feedback to the student so that the student's learning was supported by study participation. Both "live" and "video" examiners were asked to perform retrospective think-aloud (Van Den Haak, De Jong and Schellens, 2003) describing all aspects of performance which were salient to their judgements. Next, researchers used semistructured interviews (Galletta, 2013) to explore examiners' perceptions of: judgemental influences, simulation authenticity, encountered difficulties, information management strategies and their judgemental certainty. Topic guides were derived from our initial literature review and evolved to test emergent theory. Researchers probed "video" examiners' perceptions of the availability of salient visual and audio information and "live" examiners' perceptions of differences between modalities. Further questions explored all examiners' comfort making video-based judgements, perceptions of the acceptability and intrusiveness of filming and potential implications of uses of the videos.

Student interviews and focus groups
Semi-structured interviews with students explored their awareness of cameras, or any perceived influence of cameras on their performances, along with their perceptions of the acceptability, challenges or potential educational benefits of video within assessment. Focus groups (Gill et al., 2008) explored issues at the intersections of students' and examiners' perspectives by discussing a similar range of issues. Focus groups were conducted on the same day as a filming iteration, and involved examiners and students who had been present for that iteration.
Data Analysis: Analysis in design-based research can draw from an array of methods (Anderson and Shattuck, 2012), but analysis methods derived from grounded theory (Guba and Lincoln, 1982;Charmaz, 2006) have been recommended for interview and focus group data to ensure rich theory development (Bakker and van Eerde, 2015). Using these analysis methods, interview and focus group data were analysed iteratively, interspersed with new data collection. Two researchers (PY, a clinician educator, and AM a post-doctoral health psychologist) independently performed both inductive and theoretical open coding (Bryant and Charmaz, 2019) of early iterations' data. Through frequent discussion, analysts agreed a coding frame which evolved as analysis progressed. AM coded all data whilst PY additionally coded all think-aloud data and six interviews. Discrepancies were resolved through discussion.

Researchers used constant comparison involving challenge and search for discrepancy in both new
and existing data (Lincoln YS, 1985), micro-analysis (Engward, 2013),and memo-writing (Montgomery, Bailey and Bailey, 2007). Consistent with our design-based research approach (Cobb et al., 2003), data from interviews, think-aloud and focus groups were integrated with researchers field notes of practical adaptions to filming conditions and observations of effects of particular modifications as we developed axial (Strauss and Corbin, 1998) and then selective codes (Holton, 2010) which were used to organise the final theory. Data sufficiency (Varpio et al., 2017) was deemed to have occurred when the developed theory adequately described all observations within the 9th and 10th iterations.
In line with the approach adopted in prior design-based research (Koivisto et al., 2018; Papavlasopoulou, Giannakos and Jaccheri, 2019) some of the reported results are illustrated by verbatim quotes from participants whilst other findings are drawn from researchers observations across multiple iterations and are therefore not illustrated with quotes.

Results:
Sixteen students and fourteen examiners participated across 10 iterations of data collection.

Participants represented diversity of nationality and UK regions, ethnicity and people for whom
English was not their first language (see table 1). Iteration details are presented in table 2. Interview, think aloud and focus group data comprised approximately 28 hours corresponding to 313 pages of data, which sat alongside notes summarising observations and modifications at each stage.

Making Judgements from Videos
Examiners described a sense of detachment when judging performances by video which made them less immersive than live judgements.

… it's very hard to pinpoint what I may have lost from watching it on the video, I'm not sure if
there was anything specific that was lost. But it just felt very different … Examiner 1, interview.
Examiners described reduced volitional control over what they saw which contributed to this sense of distance. Despite this, they were usually able to comfortably make judgements on performances from videos.

I feel overall, again I feel confident of my judgment compared to if I was in the room… I actually
feel quite happy I've got adequate amount of information to make a judgement.

Examiner 11, interview
Video was perceived to be capable of enhancing assessment, but this capability depended on achieving a sufficient compromise between three inter-related themes: ensuring information adequacy in videos; interaction with examiners' judgemental tendencies; and balancing acceptability and purpose.

Ensuring Information adequacy in videos
Broadly speaking, examiners commented on similar aspects of performances whether examining live or video performances. Whilst in some instances, pairs of examiners varied in their focus or interpretation, these appeared to emanate from individual differences between examiners rather than the modality. There did not appear to be any overall systematic influence of the modality (video vs live) on examiners' focus, judgemental processes, or interpretation of behaviour.
Despite this general overall impression, there were clearly instances where video-based examiners weren't entirely satisfied with the information which videos presented: sometimes this related to information excluded from the shot ("so we don't have his head this time … so absolutely no facial expressions", examiner 5, viewing tripod camera); sometimes about the clarity of detail within the shot ("So I can't actually see what she's prescribing at all, so I don't feel able to actually give that a mark." Examiner 11, viewing video); sometimes something in the shot was obscured (i.e. the student had faced away from or blocked the camera). In some instances the sound was indistinct ("I could see him talking to the student but I couldn't actually hear exactly what he was asking", Examiner 4 watching headcam video) or the lighting was poor. Rarely examiners described an overall sense that communication simply had not transmitted as well as in the live scenario. These challenges were more common in early iterations as researchers developed the filming approach. Examiner 11, wide-angle and zoomed tripod camera views, Iteration 6 We found that the optimal combination of these factors varied for different types of OSCE task.
Examiners generally preferred viewing consultations from a seated eye-level (0.8-1.0m), whilst procedural skills were seen clearly using a zoomed-in camera 1.8m from the ground. Examiners were satisfied with a single perpendicular view of a consultation (student facing the SP). Conversely procedural or examination skills required two views: one oblique wide angle view to give a sense of interaction with the SP, and a reverse angle view to give close-up procedural detail (see figure 1).
Particular requirements emerged for specific tasks: the need to see the simulated patient's back during a respiratory examination; the need to see a close up of the student's writing in a prescribing task. Consequently we found that it was necessary to be able to move the position of cameras within the room for different station set ups.
The type of video cameras influenced information adequacy. Tripod-mounted camcorders were flexible, but it was difficult to get sufficient height to see procedures without being obtrusive.
Conversely, ceiling camera views seemed "flattened" and made it difficult to see facial expressions: Working from these observations, we found the optimal balance was achieved by using two wallmounted CCTV cameras. These gave excellent image quality, were unobtrusive to students and enabled rapid video processing. To enable flexible positioning we developed movable frames which let the cameras be set at various heights and positions within a room. We chose camera positions, angulation and zoom for each station based on analysis of its layout and tasks (see figure 1).
Consequently we found that videos were capable of providing examiners with sufficient information to make dependable judgements, but required both technical audio-visual expertise and analysis of station content by someone with clinical/educational expertise to choose station-specific camera positions and settings.

Interaction with Examiners' Judgemental Tendencies
Despite not having the immersive, three dimensional immediacy of live examining and there being occasional details which examiners could not see or hear, examiners were, for the most part, comfortable to judge performances via video. Examiners described (or displayed) a range of judgemental tendencies which enabled them to manage the limitations of videos. Examiners described a tendency (in both live and video scenarios) to make global judgements of candidates, or to be guided by the candidate's fluency.
there was something about his overall approach, it was… he could have been slightly slicker and more fluent but he was, he clearly knew what he was doing.

Examiner 5, interview (video)
This enabled examiners to make judgements even if some specific details were missing. Examiner 10 commented that not all information in the performance was salient; that small aspects of performance were "not a deal breaker" (Examiner 10, Iteration 6). Examiners sometimes made inferences about specific aspects of a candidate's performance which they had been unable to see.
Occasionally examiners would do this for aspects of performance which were extremely important:

So I felt I've made a big assumption that she has primed the line correctly which I think for this particular skill is quite a big assumption to make. Because if you had flushed a line full of air
into someone, that's a "never event". But I think given her overall demeanour and the confidence, I could tell there was fluid in the chamber, I feel confident I've probably made the correct decision there. Examiner 11, interview (video) Notably, despite offering a judgement at the time, this examiner expressed further doubts about the clarity of their observation later in the interview. A few examiners described having an instinct about when it was ok to make inferences. Examiners sometimes referred to proxy information when trying to interpret situations where there was something which they couldn't see or hear:

It looks as though he's going to be doing the ankles but you can't actually see the ankles but you sort of move down that end Examiner 5, Iteration 3 (video)
Only occasionally did an examiner state that the video did not offer sufficient information to enable them to make a reasonable judgement. Notably, these judgemental tendencies by examiners' were not limited to video-based judgements. Examiners described making inferences during real OSCE examining, sometimes in response to fatigue or brief lapses in concentration, or their judgements being influenced by a global sense of performance.
Consequently, a number of well described judgemental tendencies (global judgements, inferences, differential salience) along with some previously undescribed processes (using proxy information) appeared to enable examiners to manage most challenges which emanated from videos. Whilst often reflective of authentic live examining or an examiner's expertise, in some instances examiners might have tended towards making video-based judgements which were not adequately informed to ensure safe assessment decisions. Consequently, examiner's judgemental tendencies appeared to interact with the adequacy of information in the videos to either enhance or detract from the overall quality of video-based judgements. This was particularly the case for stations where it was harder to ensure information quality, for example in procedural skills stations where fine detail or particular movements had the potential to importantly influence examiners' judgements.

Balancing Acceptability and Purpose
In the majority of iterations, students perceived that cameras were only minimally intrusive. Several commented that they rapidly forgot about cameras as they engaged in the task.
I think with the camera positioning, both of us agreed that when we were talking to the patient, the camera wasn't in our face. It wasn't even in our vision. Student 1, focus group In one instance, Student 5 who was simultaneously being filmed by four cameras (two tripod cameras, a ceiling camera and an examiner head-camera) commented that they were only passingly aware of one tripod camera. Another student (student 10) commented that despite passing awareness of the cameras, they didn't feel the cameras increased their sense of scrutiny beyond that caused by the examiner.
Despite these general assurances, there were instances where both students and examiners perceived that the cameras had disrupted a student's performance when you asked me the questions and I was answering and I looked at the camera, I forgot the question.

Student 1, focus group
I didn't mind the camera at all but I didn't like what the camera did to the student 'cos she was fine until she turned around to answer my questions, saw the camera and panicked.

Examiner 2, interview
Whilst this only occurred in a small minority of cases it indicated the potential for students to freeze or lose their train of thought in response to seeing cameras. Some students perceived that other students within their year (outside of our sample) who were more prone to assessment anxiety would be at greater risk of disruption.
We found that negative influences on students' performance could be prevented by careful positioning of equipment. For example, we found that positioning cameras 1/ where students would look whilst performing the task, or 2/ where they would look whilst talking to the examiner could potentially be disruptive. Despite this, positioning a camera in the arc between points 1 and 2 it did not appear to be disruptive, despite students' moving their gaze across this arc whilst turning to face the examiner. As a result a tension existed between, on one hand, optimising camera and microphone placement to maximise information adequacy for examiners and, on the other hand, minimising the potential for the presence of cameras or microphones to unduly influence students' performance.
For both students and examiners, the acceptability of any potential intrusion by cameras was balanced against the potential benefits of videoing ("It depends on the goal" Student 2). Most students were clear that they cared greatly about standardisation in OSCEs and believed that there was room within current practice for it to be enhanced. As long as cameras were not unduly obtrusive, students perceived that the potential for video to enhance standardisation offset any sense of intrusion which cameras caused. The intended use and distribution of videos was important to students. Student 2 commented that restricting access to their videos to just a few members of staff (rather than wider availability) was an acceptable degree of exposure given the potential for the videos to enhance standardisation.
Examiners and students described numerous ways videos might enhance OSCE standardisation: to facilitate examiner score comparisons, benchmarking or training, or mediation of appeals.
Participants described potential enhancement of students' learning through video-based feedback: to actually see your performance and think 'Oh okay yes, I can see I really didn't do well there' … that would be very helpful … you'd get so much more out of the OSCE experience rather than just 'It's an exam'.

Student 8, interview
Some participants suggested assessments would be more authentic with just the student and simulated patient in the room.
Students and examiners perceived that cameras might help to prevent examiners deviating from assessment instructions.
[referring to being videoed] You have to actually listen carefully and use the mark scheme that everybody else is going to be using, otherwise you will stand out like a sore thumb. So I think it's good for examiners.

Examiner 15, interview
Whilst potentially beneficial to standardisation, examiners and students perceived that videoing could detrimentally influence examiners' interactions with candidates. Both examiners and students suggested examiners may legitimately encourage students, especially when flustered or nervous, but that perceived scrutiny of their behaviour by video might make examiners stricter or colder.
Student 1 commented that they look for indications of examiners' approval, which they felt they would be lacking if video were being used.
Consequently, both students and examiners perceived that video could enhance assessments without unduly influencing students' performance or being unacceptable to participants.

Summary of results
Examiners experienced video-based performances differently to live performances. Whilst judgemental processes were similar between video and live modalities, specific combinations of station content and filming conditions limited the adequacy of information availability or interacted with examiners' judgemental processes to produce judgements which may not have been fully representative of live judgements. Students rarely perceived cameras as intrusive and performance disruption could be avoided through thoughtful camera placement. Video was perceived to enhance assessment when a sufficient balance was achieved between: supplying examiners with enough information; minimising intrusion; and ensuring the purpose of videoing provided adequate justification (see figure 2 for illustration).

Relationship to existing literature and theory
We observed a number of well-described features of examiners' judgements including global judgements , first impressions (Wood, 2014) and the use of inferences (Govaerts et al., 2011;Kogan et al., 2011). Whilst these processes are collectively presumed to emanate from automatic (or system 1) judgements or schema-based processing (Bargh and Chartrand, 1999), the degree to which examiners' judgements rely on automatic as opposed to conscious deliberate processing remains debated (Gauthier, St-Onge and Tavares, 2016). The observation that examiners in our study were often comfortable making judgements despite limitations in visual information, and, moreover, that examiners described missing details within their observations of real OSCEs due to fatigue or lapses in concentration, tends to suggest that the role of automatic (system 1) reasoning in examiners' judgements could be substantial. Whilst automatic judgements may provide an efficient judgemental means of managing the mental workload of assessments (Byrne, Tweed and Halligan, 2014;Tavares and Eva, 2014), they also enable the Halo effect (the tendency for an impression created in one area to influence opinion in another) (Gingerich, Regehr and Eva, 2011). Our study suggested that this could be a particular concern in video-based assessment when critical details are not captured (i.e. fine detail of equipment handling, detail of written information) if they contradict the more general impression created by the performance.
Our concern that students' performance might be impeded by the presence of video cameras was only occasionally realised. Students attributed their limited awareness of video cameras to focusing their attention on the assessment task. This may be an example of "inattentional blindness" (Simons and Chabris, 1999) in which people fail to perceive a clearly visible object (for example a gorilla walking amongst people playing basketball (Simons, 2010)) due to actively focusing on something they consider important. As a result, video cameras may be less intrusive than anticipated, as long as students are actively focused on a task. Nonetheless, the fact that cameras occasionally caused students to freeze underscores the importance of mitigating this risk by careful camera placement.

Implications for practice
Based on our findings, we recommend that whilst video has the potential to enhance assessment in several ways, careful set up of video equipment (position, zoom, focus) must involve someone who understands the clinical and educational content of each station. Consideration should be given on a station by station basis to the likely actions of students, their predictable gaze patterns, the movement of examiners, and the features of performances for which close-up detail may be required. Video-capture should ensure that examiners have adequate views of critical performance elements to avoid the use of potentially detrimental inferences. This risk appears to be contentdependent and may pose greater risks in procedural skills stations.
Many of the compromises we have described (information vs intrusion; acceptability vs purpose) will vary for different assessment purposes. Students may be less concerned about intrusiveness in an assessment which has few consequences. Situations where examiners need to view videos immediately after candidates' performances may preclude video-processing to provide zoomed and wide-angle views. Students may agree to their videos being used to help standardise the exam in which they have participated (Yeates et al., 2019), but not agree to their broader use in faculty development. As a result different choices are likely to be appropriate in different contexts.
Whilst video may seem to offer an attractive means of settling appeals, reviewing them could impose substantial institutional time demands or produce vulnerability to legal challenges.
Consequently institutions should think carefully about the duration of video storage and how they communicate with students about the purpose, use and access of videos.
As a result, whilst we anticipate that the filming solution we described in the results (based on dual CCTV cameras which can be moved to different positions and heights within the room) will be suitable for many assessment situations, use of this equipment in each specific assessment context should be tailored to balance the competing tensions we have described.

Reflexivity
Four researchers (PY, AM, RF, RMK) are involved in researching video-based methods to enhance OSCE standardisation (Yeates et al., 2019). These researchers acknowledge a motivation to attempt to ensure equivalence between live and video-based scores. Including 2 other researchers (JL, LC) who are heavily involved in teaching and assessing clinical skills, and 1 undergraduate medical student, brought balance to the research team.

Limitations
Whilst our study had significant strengths in terms of diverse sampling, careful data collection and rigorous analysis, it nonetheless has some limitations. The design of our study prevented us from determining the influence of modality (video versus live) on the scores of individual stations as these comparisons were confounded by inter-examiner variability. Whilst these questions have already been addressed by large quantitative comparisons (Chen et al., 2019), our purpose was to understand how the modality might influence judgemental processes. By exploring participants' experiences and perceptions and comparing across several iterations our method was able to offer insight into this phenomenon.
Participants (both students and examiners) were self-selected volunteers, the scenarios were simulated (and therefore low-stakes) and no students reported significant assessment anxiety. We can't exclude the possibility that videoing would be more intrusive in higher stakes settings. Further research is needed to explore this potential. We sampled across a diverse range of students and examiners. Whilst we didn't find any suggestion of differential effects of videoing on any group of students or examiners, we can't exclude the possibility that video-based assessment could operate differently for groups of students or examiners outside of our sample.

Suggestions for future research
Further research should seek to replicate these findings in other contexts, with other groups of participants. Survey research could determine whether our participants' perceptions are shared more widely amongst students and examiners. Given that some prior research has suggested that video-based performances may be remembered differently to live performances (Ihlebaek et al., 2003;Landström, Granhag and Hartwig, 2005), future research could determine whether videobased performances obtain greater prominence in assessors recollections. This could be important in longitudinal forms of assessment. Research should determine how emerging uses of video in assessment (remote examining, benchmarking, video-based feedback, video-based score comparisons) enhance assessments or contribute evidence towards their validity.

Conclusions
Whilst video offers the potential to enhance assessment through several novel means, its implementation requires care. Educators should thoughtfully balance the intended purpose of videoing; the content-specific need to ensure information adequacy for examiners; and the potential for filming to disrupt assessment performance, to produce a compromise which enhances assessment validity and supports students' learning.
Word Count: 5418 words