Physiological measures of operators’ mental state in supervisory process control tasks: a scoping review

Abstract Physiological measures are often used to assess the mental state of human operators in supervisory process control tasks. However, the diversity of research approaches creates a heterogeneous landscape of empirical evidence. To map existing evidence and provide guidance to researchers and practitioners, this paper systematically reviews 109 empirical studies that report relationships between peripheral nervous system measures and mental state dimensions (e.g. mental workload, mental fatigue, stress, and vigilance) of interest. Ocular and electrocardiac measures were the most prominent measures across application fields. Most studies sought to validate such measures for reliable assessments of cognitive task demands and time on task, with measures of pupil size receiving the most empirical support. In comparison, less research examined the utility of physiological measures in predicting human task performance. This approach is discussed as an opportunity to focus on operators’ individual response to cognitive task demands and to advance the state of research. PRACTITIONER SUMMARY Physiological measures can provide the basis for dynamic operator assistance in supervisory process control tasks. This review synthesises the existing evidence, highlighting both the aggregated empirical support and the heterogeneity of the results. To advance the status quo, a larger emphasis on physiological measures as predictors of operator performance is needed. Abbreviations: HF/E: Human factors and ergonomics; CNS: Central nervous system; PNS: Peripheral nervous system; HR: Heart rate; HRV: Heart rate variability; IBI: Interbeat interval; AVNN: Average of RR intervals; SDRR: Standard deviation of RR intervals; CVRR: Coefficient of variation in RR intervals; RMSSD: Root mean square of successive; pNNX: Percentage of successive RR intervals; MAD: Median absolute deviation; LF: Power of the low-frequency; MF: Power of the mid-frequency; HF: Power of the high-frequency; TP: Total power.


Introduction
The transition from manual control to (partial) automation-especially, in aviation and the remote control of lunar vehicles-introduced an interest to assess the ability of humans to perform supervisory control (Ferrell and Sheridan 1967;Sheridan 1960).In supervisory control, the system processor performs continuous control by relying on the system's sensors and actuators, while providing the human operator with intermittent feedback and accepting the operator's corrective commands (Sheridan 1997(Sheridan , 2021)).Thus, the human-system relationship mirrors that between a supervisor and an employee; the supervisor sets directives, which the employee translates into actions and reports the aggregated results back to the supervisor (Sheridan 1997(Sheridan , 2021)).In this sense, supervisory control combines five human functions: (1) task planning, (2) system teaching, (3) system monitoring, (4) corrective intervention, and (5) learning from experience (Sheridan 2021).To date, human factors and ergonomics (HF/E) research has largely focused on system monitoring and corrective intervention.
Within supervisory control research, supervisory process control tasks in industrial contexts became a particular focus early on, as the high complexity and scale of processes created enormous potential for optimisation through system automation (e.g.Johansson 1989;Moray 1997;Woods, O'Brien, and Hanes 1987).However, process automation did not make human control obsolete.As Moray (1997Moray ( , 1948) ) noted, 'it is precisely in the flexibility and creativity of the human workforce that the protection against errors and inefficiencies of such technology lies'.Importantly, although the automation of industrial processes continues to advance, especially given the introduction of digital manufacturing technologies in smart factories (Spath and Braun 2021), expectations for human involvement in future industrial processes are consistent with what Moray argued over two decades ago.Humans will continue to play an important role (Neumann et al. 2021;Sony and Naik 2020;van Dyck et al. 2022), with their tasks shifting to supervision in even more areas (Kadir, Broberg, and Conceic¸ão 2019;Rauch, Linder, and Dallasega 2020).
A key challenge in supervisory process control tasks is that the cognitive demands on the operator vary significantly depending on the mode of operation.While cognitive demands may be low during routine operation, forcing the operator to monitor passively, abnormal events can lead to spikes in cognitive demands, placing time pressure on the operator to process the situation and choose a response (Endsley 2017;Onnasch et al. 2014).The envisioned solution to this conundrum in HF/ E research is the dynamic allocation of tasks between human and system, with the system keeping the operator in the loop as much as possible, but taking on more responsibility when needed (Opperman 1994;Scerbo 1996).However, an effective implementation of dynamic operator assistance requires the system to be capable of assessing the operator's actual demand for more involvement or support.In this regard, physiological measures have become one of the primary approaches to assess the mental state of operators during task execution (Charles and Nixon 2019;Tao et al. 2019).
Accordingly, extensive research efforts have been devoted to the use of physiological measures as indicators of operator mental state in supervisory process control tasks.These efforts, however, are spread across a wide range of physiological measures, application fields, research goals, and mental state dimensions.While this diversity of empirical research provides a rich base from which to draw, it also makes it difficult to link the findings and form a unified body of evidence upon which to build future research.To address this issue, the aim of this scoping review is to map the existing empirical research on physiological measures in supervisory process control settings.In doing so, it seeks to provide guidance to researchers and practitioners by synthesising the evidence base for individual research areas, identifying where robust evidence has been gathered, pointing out present inconsistencies, and highlighting remaining gaps in our understanding that require further investigation.

Humans and automation
Supervisory process control is closely related to many topics in human-automation interaction research, as system automation is the prerequisite for shifting the human role away from manual control (Sheridan 2021).
Here, system automation is defined as allocating activities within a task previously performed by the human to the system (Parsons 1985;Parasuraman and Riley 1997).It must be distinguished from system autonomy, which refers to a system being able to learn and evolve by changing its functional capacities (Hancock 2017a).As long as full system autonomy cannot be realised, as it is expected for the foreseeable future (Endsley 2017;Kaber 2018), it is crucial to consider how humans and automation can interact effectively.When this is achieved, automation can empower the human operator to control highly complex systems (Miller and Parasuraman 2007), increasing the overall human-machine system performance while preventing operator overload (e.g.Li et al. 2014;Lorenz et al. 2002;Manzey, Reichenbach, and Onnasch 2012;Reichenbach, Onnasch, and Manzey 2011;Wright, Chen, and Barnes 2018).
Despite these benefits, adverse effects of automation on the human role have also been a focus of past research (Bainbridge 1983).The central concern is that the system not merely substitutes some human activities, but fundamentally changes the human task, reducing human control to the point of passive monitoring (Hancock 2013;Miller and Parasuraman 2007).In fact, passive monitoring has long been shown to impair operator vigilance (Grier et al. 2003;Molloy and Parasuraman 1996;Parasuraman 1979) and situation awareness (Endsley and Kiris 1995;Kaber, Omal, and Endsley 1999;Manzey, Reichenbach, and Onnasch 2012).Moreover, in the event of a system failure, the operator's workload quickly changes from a very low to a very high level (Sheridan 2021).Complicating matters further, while increasing automation reliability reduces the frequency of such incidents, it also reduces the likelihood that the operator will be prepared to intervene when they eventually occur (Endsley 2017;Onnasch et al. 2014).Thus, automation can provide enormous support when all is going well, but if the operator is too far out of the loop, system failures can be even more catastrophic (Onnasch et al. 2014).
Consequently, there has long been a call for humancentered automation (Billings 1991).While humancentered automation can address a variety of system design issues (Sheridan 2021), the term is most often used in the context of task allocation, where subtasks are divided between human and system (Sheridan 1997).Today, task allocation is usually discussed in terms of levels of automation, a concept popularised by Sheridan and Verplank (1978) and extended many times (e.g.Endsley 1987;Endsley and Kaber 1999;Parasuraman, Sheridan, and Wickens 2000;Riley 1989).Whereas early models described a successive transfer of responsibilities from the human to the system with increasing levels of automation, later versions added the notion of different levels of automation for different information processing stages (Endsley and Kaber 1999;Parasuraman, Sheridan, and Wickens 2000).Importantly, even though early research deemed intermediate levels of automation as best suited for the human (Endsley and Kiris 1995), the current understanding is that there is no clear optimum across system applications and usage circumstances (Onnasch et al. 2014).
To deal with varying demands, Rouse (1976) first proposed dynamic operator assistance, i.e. the dynamic reallocation of responsibilities during task execution.Here, two types of implementations can be differentiated: adaptable and adaptive automation (Opperman 1994;Scerbo 1996).While adaptable automation gives the human the authority to change task allocation, an adaptive system makes these decisions independently.Most research has focused on adaptive automation and has successfully demonstrated the possibility of mitigating the discussed pitfalls of automation (see Endsley 2017;Inagaki 2003;Parasuraman and Wickens 2008).However, adaptive automation can also create new problems, such as an operator's insufficient awareness of the current system mode (Sarter and Woods 1995).As a result, some researchers have favoured adaptable automation (Miller et al. 2005;Miller and Parasuraman 2007), a position that appears to be supported by the limited body of empirical evidence (Calhoun 2022).Yet, even if the allocation authority remains with the operator, the system should be able to propose a reallocation, as the operator might not be in the position or state to detect the need (also discussed by Sauer et al. 2011).
The concepts of dynamic operator assistance described above require the system to identify instances that would benefit from a reallocation of responsibility between human and system.To this end, Parasuraman et al. (1992) and Kaber and Endsley (2004) provided early frameworks for classifying potential reallocation triggers, which both include the assessment of the operator's current state through psychophysiological assessments.The central idea is that by identifying the operator's current ability to cope with task demands, the system could base task allocation on both individual and situational factors.The approach is also consistent with the understanding, assessment, and prediction of operator state as a key driver of past HF/E research.

Mental state and physiological measures
Supervisory process control is particularly reliant on cognitive processing.Therefore, this article focuses on the operator's mental state and means to assess it.Corresponding research has typically focused on mental workload (Young et al. 2015) which can be defined as 'the degree of activation of a finite pool of resources, limited in capacity, while cognitively processing a primary task over time mediated by external stochastic environmental and situational factors, as well as affected by definite internal characteristics of a human operator, for coping with static task demands, by devoted effort and attention' (Longo et al. 2022, 18).Research on mental workload can be structured by an explanatory framework by van Acker et al. (2018), which we have adapted for the purposes of this article (see Figure 1).Concisely, the antecedents of mental workload are cognitive task demands that induce individual-specific levels of workload.In turn, this influences measurable human task performance, with detrimental effects that result from either mental under-and overload (Sharples and Megaw 2015).These relationships are moderated by human and system attributes.
Further dimensions of the operator's mental state that have been considered include mental fatigue, stress, and vigilance.Fatigue can be defined as 'a physiological state of reduced mental or physical performance capability resulting from sleep loss, extended wakefulness, circadian phase, and/or workload' (International Civil Aviation Organization 2015).It has, thus, by definition an adverse effect on task performance.While there is no universally accepted definition of stress, most accounts focus on the human's appraisal of external demands in the context of the resources available for coping with them (Hancock and Szalma 2006;Hobfoll 1989).Similarly, the relationship between stress and performance is less clear, although usually one similar to that of mental workload is assumed (Hancock and Szalma 2006).Finally, vigilance refers to an operator's alertness to signals over prolonged periods of time (Warm, Matthews, and Finomore 2008).Since research in this area has been driven by interest in the vigilance decrement, i.e. a decreasing signal detection rate over time, it is directly linked to human task performance (Hancock 2013(Hancock , 2017b;;Neigel et al. 2020).
The use of physiological measures to assess these mental state dimensions has increasingly expanded (Tao et al. 2019), as the needed sensors have become more accessible and easier to use (Mach et al. 2022;Nixon and Charles 2017).They can be categorised into central nervous system (CNS) and peripheral nervous system (PNS) measures (Hancock et al. 2021).CNS measures capture the operator's brain activity, theoretically providing a close link to human cognition.However, the required sensors are often discussed as intrusive and highly susceptible to interference (e.g.Afzal et al. 2022;Gao et al. 2013;Marchitto et al. 2016), which still limits their suitability for real-world settings.PNS measures assess the response of the operator's autonomic nervous system to variations in arousal caused by changes in mental state.Commonly used methods capture the operator's blood pressure, electrocardiac activity, ocular metrics, respiration, or skinrelated indicators (Charles and Nixon 2019;Tao et al. 2019).The main advantage of physiological measures is that they can be used continuously with minimal interference with the primary task, making them a suitable option in the context of dynamic operator assistance.
Consequently, the implementation of adaptive automation based on a physiological assessment of the operator's mental state gained early research interest (Freeman et al. 2004;Pope, Bogart, and Bartolome 1995;Prinzel et al. 2003;Wilson and Russell 2003) that continues to this day (e.g.Aric� o et al. 2016;Di Flumeri et al. 2019;Ting et al. 2010).In addition, researchers have explored real-time mental state feedback (Gado et al. 2021;Maior, Wilson, and Sharples 2018) which could be used in the context of adaptable automation.However, the objective of reliably using physiological measures to facilitate human-automation interactions by guiding the allocation of responsibilities between human and system is far from attained.In fact, the indirect relationships between cognitive processes and physiological responses, as well as differences in research approaches, have made it difficult to reliably link the existing evidence (Charles and Nixon 2019;Tao et al. 2019).To facilitate future research to achieve this goal in the context of supervisory process control tasks, a joint understanding of the current state of research on the use of physiological measures in this application domain is needed.

Research approach and questions
Previous research has generated a large body of empirical evidence on the use of physiological measures of operators' mental states in supervisory process control tasks.However, individual studies naturally focus on a subset of physiological measures used in a particular application field and pursue a precise research goal while investigating selected mental state dimensions, i.e. drawing on selected theoretical frameworks and models from the HF/E literature.The resulting heterogeneity of research approaches creates rich opportunities for initiating and linking further research, but also makes it difficult to identify and consider all relevant findings when developing research questions, designing studies, and interpreting data.To address this issue, the aim of this article is to provide a comprehensive overview of the empirical research on the physiological assessment of operators' mental state in supervisory process control tasks in  (2018).
the form of a scoping review (Arksey and O'Malley 2005).The review seeks to answer the following five research questions: 1. Which physiological measures have been applied?2. Which application fields have been investigated?3. Which research goals have been pursued?4. Which mental state dimensions have been analysed?5. What empirical evidence has been obtained?As a scoping review, the goal of the review is not to aggregate empirical evidence for a very specific research question, but to structure and synthesise the existing body of research, including applied research practices (Munn et al. 2018;Xiao and Watson 2019).This structure is intended to guide researchers and practitioners in identifying empirical evidence relevant to their own research.Thereby, we hope to assist researchers investigating a particular physiological measure, application field, research goal, or mental state dimension to draw on evidence from other areas of the vast supervisory process control literature.Furthermore, the review highlights existing foci and gaps in previous research, showing where substantial evidence has been obtained and where additional research is needed.In this way, the article contributes to the development of a unified body of evidence for the use of physiological measures as a necessary basis for their reliable application in guiding human-automation interaction.

Review protocol and eligibility criteria
The review protocol was developed based on the Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR; Tricco et al. 2018) and is provided in the supplementary materials.To identify articles for the outlined research questions, the following four eligibility criteria were selected.Articles were required to report: 1. empirical research involving either original or existing data sets.2. the investigation of human operators in supervisory process control tasks.The review adopted a broad definition of supervisory process control including both typical continuous process control tasks, such as industrial process control and power plant operation, and discontinuous process control tasks, such as traffic control and the supervision of automated agents.Not considered were supervisory control tasks where the primary task is the direct spatial control of a vehicle (e.g. car or plain) or another type of technical system (e.g.robot or crane), as they pose different demands on the operator in manual control and take-over scenarios compared to the supervisory process control tasks of interest.3. the application of at least one physiological measure that quantifies the response of the operator's PNS during the execution of the task.Due to their limitations in real-world settings, CNS measures were outside the scope of this review.Also not considered were ocular metrics that analyse the operator's gaze pattern across predefined areas of interest (AOIs), as AOIs are domain specific and thus lack comparability between the different application fields included in this review.4. an objective reference for the mental state assessment, by specifying either potential differences in cognitive demands or task performance measures as antecedents and consequences of mental state variations, respectively.
In addition to the eligibility criteria, articles were filtered based on four quality criteria to support the reliability of the data synthesis.Articles were required to report: 1. research findings that were not already the subject of another included article, excluding duplicate publications of the same empirical results.2. descriptions of the methods and results including details on the sample, the experimental task, the experimental procedure, the experimental design, the obtained data, and the conducted statistical analyses.3. an analysis of the results of physiological measures that can be interpreted independently from other variables.This criterion is violated, e.g. when physiological variables are included together with other predictors in a single model, preventing insights into the individual contributions of the predictors.4. empirical research conducted with at least ten participants when using a convenience sample.Smaller expert samples were included, as the investigation of domain experts can provide important additional insights due to the higher level of ecological validity.
Eligibility and quality criteria were applied to articles published in peer-reviewed journals and conference proceedings since 2000, written in English.The review thus focuses on the current state of research that is part of the scientific discourse.

Search strategy and selection of entries
The systematic literature search included four databases: Scopus, Web of Science, PsychInfo, and PSYNDEXplus.The search strategy combined search terms from the two categories of physiological measures and supervisory process control tasks that were searched for in the title, abstract, and keywords of the database records: The search terms for physiological measures included both instruments and metrics that were drawn from the review on physiological measures of mental workload by Charles and Nixon (2019).The search terms for supervisory process control tasks was developed in two steps.A first database search combined the search terms for physiological measures with the terms up to the inner parenthesis, i.e. control centre to traffic control, covering a range of task descriptions and operation environments as well as the two major application fields in the process control literature, (nuclear) power plants and (air) traffic control.After removing duplicates, the titles and abstracts of the obtained records were screened for eligibility.
In a next step, application fields that were reported in more than one article were added to the supervisory process control task search terms for a second database search, which excluded all records already identified in the first search.After repeating the removal of duplicates and the screening of titles and abstracts for the records of the second search, the full texts for all potentially eligible articles were retrieved and assessed based on the eligibility and quality criteria.Finally, the references of the eligible articles were screened to identify relevant articles missed in the database searches.The original database exports were performed on 4 June 2022, and then updated on 27 October 2023, as part of the article revision.

Data charting and synthesis
First, experiment type and sample information, including type and size, were extracted.Experiment type was categorised into laboratory and field experiments, whereby laboratory experiments were further divided into simulation and single-trial experiments.In simulation experiments, participants interacted with a continuous simulation of the process, whereas single-trial experiments involved separate task stimuli that either were static or displayed very brief process sequences.The samples were categorised into expert, trainee, and convenience samples.Expert samples included participants who either worked as operators in the investigated application field or had done so in the past, whereas trainee samples included participants who were undergoing vocational or academic training in the application field.For the sample size, the number of participants included in the analysis of physiological measures was extracted.In the case of between-subject designs, the smallest sample size across conditions was also extracted.When subsample sizes were not reported, they were estimated based on the total sample size and the number of conditions.It was also extracted whether multiple participants worked as a team during the study.
Second, to answer research questions (1) and ( 2), the applied physiological measures and the application field were extracted.Application fields were classified as either continuous or discontinuous control processes.Third, the studied mental state dimensions were collected as reported in the articles to address research question (3).Fourth, for research question (4), the analysis designs for the physiological measures were extracted.The included variables were grouped and assigned to the three research goals of validation, evaluation, and prediction.Both validation and evaluation refer to designs that treat physiological measures as outcome variables.Whereas validation research investigates whether physiological measures are sensitive to cognitive demands (over time), evaluation analyses assume the validity of physiological measures to investigate the effect of other variables on the operator's mental state.In contrast, prediction refers to research that uses physiological measures as predictors for human task performance.Fifth, the reported results for physiological measures and any analyses linking them to performance measures were extracted for research question (5).
For data synthesis, the study characteristics and results for research questions (1) to (4) are presented with simple descriptive statistics that illustrate the frequency of the approaches and topics in existing research.To provide an accessible overview of the extend to which different physiological measures, application fields, and research goals are combined in the literature, evidence maps were created that highlight research foci (see Miake-Lye et al. 2016).Finally, research question ( 5) is addressed by structuring the relevant references and summarising the key findings.
Here, the large heterogeneity of experimental designs and data analysis approaches expected in a scoping review makes the application of quantitative synthesis methods less meaningful.

Study characteristics
The study selection process is illustrated in Figure 2. The process resulted in 106 eligible articles, of which 3 were published from 2000 to 2004, 11 from 2005 to 2009, 20 from 2010 to 2014, 40 from 2015 to 2019, and 32 since the beginning of 2020, showing a steady increase in research interest.All articles report the results of a single study except 2, which report two (Schmitz-H€ ubsch, Stasch, and Fuchs 2021) and three (Vogt, Hagemann, and Kastner 2006) studies, respectively.The articles by Fallahi, Motamedzade, Heidarimoghadam, Soltanian, and Miyake (2016) and Fallahi, Motamedzade, Heidarimoghadam, Soltanian, Farhadian, et al. (2016) as well as by Yan, Tran, Chen, et al. (2017) and Yan, Tran, Habiyaremye, et al. (2017) report separate results from the same two studies.For clear referencing, these are presented separately, resulting in 109 studies included in the data synthesis.
The study characteristics and extracted features of the 109 studies are presented in Table 1.The vast majority of studies, 102, was conducted in a laboratory setting, of which 12 used a single-trial design and 90 a continuous simulation.6 studies were conducted in the field and one study combined measurements in both laboratory and field setting.At 62, most studies were conducted with a convenience sample, 8 with a trainee sample, 34 with an expert sample, and 5 with a mixture sample.Of the latter, 3 studies combined laypersons and experts and 2 included both trainees and experts.The sample size of convenience samples ranged from 10 to 152 (Median (Mdn) ¼ 22), of trainee samples from 10 to 47 (Mdn ¼ 20.50), and of expert samples from 1 to 41 (Mdn ¼ 16).Studies that included both laypersons and experts each had a sample size of 39, and the studies with trainees and experts of 10 and 14, respectively.8 studies used a team setting where multiple participants worked together.

Physiological measures
As a first step to guide researchers in selecting physiological measures, we provide an overview of the types of metrics that have been explored so far.Physiological measures used in supervisory process control studies primarily include blood pressure, electrocardiac, ocular, respiratory, and skin measures.Ocular measures are the most popular approach with 71 respective studies, followed by electrocardiac measures with 49, skin measures with 10, respiratory measures with 7, and blood pressure measures with 6 studies.In addition, muscle tension measures were used in 2 studies and head position and facial movement in one study each.These are summarised as other measures in the following overviews.84 studies used only one type of physiological measure, 18 two types, and 4 three types.One study each used four, five, and six types of physiological measures.Of note, all but a single study used at least one of the two most common approaches, ocular and electrocardiac measures.
For all types of physiological measures, different metrics can be distinguished.Blood pressure measures include the systolic and the diastolic pressure.Electrocardiac measures include the heart rate (HR) and the heart rate variability (HRV), which refers to the fluctuations of time intervals between consecutive heartbeats.Metrics corresponding to HR are also reported as the mean interbeat interval (IBI) or the average of RR intervals (AVNN).HRV metrics can be further divided into time-domain and frequencydomain indices (Shaffer and Ginsberg 2017).In this review, time-domain indices included the standard deviation of RR intervals (SDRR/SDNN), the coefficient of variation in RR intervals (CVRR; Jeroski et al. 2014), the root mean square of successive RR interval differences (RMSSD), the percentage of successive RR intervals that differ by more than X ms (pNNX), as well as the median absolute deviation (MAD), skewness and kurtosis of the IBI distribution.Frequency-domain measures included the absolute and relative power of the low-frequency (LF), mid-frequency (MF), and highfrequency (HF) bands, as well as the total power (TP).One study also analysed the frequency and duration of non-stationary HRV phases.
Ocular measures related either to pupil size, blinks, fixations, saccades or gaze patterns.For pupil size, metrics included the mean, standard deviation, maximum, total energy, fractal dimension and latency of the maximal pupil response.Also used were pupil eccentricity and pupil velocity.For blinks, fixations, and saccades, the most popular metrics were number, rate and duration, for which sum, mean and standard deviations were calculated.Saccade metrics also included amplitude, velocity, peak velocity, gain, latency, intersaccadic drift mean velocity, and intersaccadic duration.Gaze patterns were quantified using gaze density, (conditional) gaze entropy, gaze transition rate, gaze transition entropy, fixation dispersion, the Nearest Neighbour Index (Di Nocera, Camilli, and Terenzi 2007), and the Coefficient K (Krejtz et al. 2016).One study also applied electrooculography.Skin measures included skin potential, skin conductance level (tonic) and response (phasic), skin temperature and skin blood flow, whereas respiration was quantified by either rate or amplitude.Muscle tension was assessed based on the amplitude of an electromyography signal and image recognition was used for both head position and facial movement analyses.

Application fields
Next, we present a summary of the studied application fields to assist researchers in identifying application contexts from which to draw insights.Figure 3(a) presents the connection between application fields and physiological measures across studies.For 42        studies, the application field was categorised as a continuous control process, including nuclear power plant operation (28), industrial process control (8), cabin air management (4), and electricity network control (2).For nuclear power plant operation, the majority of studies focused on emergency operation, with additional studies investigating system start, reset, and shutdown, or specific subtasks, such as monitoring or accident diagnosis.Industrial process control studies mainly examined monitoring behaviour, with fewer studies focusing on system control and emergency operation.The rest of the continuous process control studies investigated the full control task for stabilising air parameters in a spacecraft life support system and managing electrical power distribution.
The application fields of the remaining 67 studies were categorised as discontinuous control processes, including air traffic control (39), control of automated agents ( 16), combat control (4), rail control (3), city traffic control (2), vessel traffic service (1), digital manufacturing (1), and real-time strategy gaming (1).For air traffic control, most studies investigated the full control task of either terminal or en-route controllers, while the rest analysed subtasks, such as conflict detection, situation assessment, tracking, and visual search.The vast majority of studies on automated agents examined the supervision of unmanned aerial vehicles.Only one study each investigated the supervision of a combination of unmanned ground and aerial vehicles as well as firefighter robots.Studies on combat control, city traffic control, rail control, and vessel traffic service examined full control tasks, whereas the study on digital manufacturing focused on the subtask of defect detection.

Research goals
To provide further insight into the structure of the body of evidence and to highlight current research foci, we synthesise the research goals of the identified studies.Figure 3(b) presents the connection between research goals and physiological measures across studies.78 studies assessed the validity of physiological measures, 56 studies used them for evaluation, and 31 studies examined their relationship with task performance, whereby 58 studies investigated one of the research goals, 46 two, and 5 all three.Among validation studies, the majority of studies (55) investigated the sensitivity of physiological measures to changes in task demands.Task demand variables included the number of units to be monitored or controlled, required input frequency, time pressure, number of parallel tasks, task complexity, and task familiarity.Event demands were defined as a separate group of predictor variables, including the presence, number, and frequency of abnormal events as well as the unexpected addition of subtasks.Respective predictor variables were used in 19 studies.5 studies investigated task external demands, including social stressors, visual stressors, and noise.The examination of the relationship between physiological measures and time on task was also included in validation research.This included 16 studies which examined between 2 and 22 (Mdn ¼ 4) time intervals defined by either duration or task block.
Among evaluation studies, 20 investigated interfaces designs, 15 task formats, 14 automation levels, and 11 human factors.Interface design variables compared visualisation concepts, input devices, and alarm designs, whereas task format variables separated control procedures, control procedure steps, subtasks, shifts, fault scenarios, and operator roles.The evaluation of automation design focused on different automation levels, automation for different processing stages, options to switch between automation levels, and automation usefulness.The examined human factor variables were either assessed a priori, including age, experience, and skills, or designed as post-hoc classifications of participants based on either performance variability or cue utilisation.Finally, of the 31 studies analysing the relationship between physiological measures and performance, 23 studies used standard correlation approaches.Of these, 15 conducted bivariate correlations, 6 multiple regression analyses with one type of physiological measures, and 2 multiple regression analyses with multiple types of physiological measures.The five remaining approaches were the comparison of physiological measures between periods of low and high performance (4), dividing participants based on physiological measures to compare their performance (1) and vice versa (1), recurrent neural network modelling (1), as well as fuzzy inference modelling (1).

Mental state dimensions
Next, we summarise which models of operator mental states were used to develop research questions and interpret empirical data.The respective mental state dimensions were collected as reported in the articles, though not all articles clearly indicated the studied dimension.Mental workload (83), including synonyms such as cognitive workload, subordinate concepts such as memory load, and antecedents such as task complexity, was investigated most often and in relation to all types of predictors except task external demands.Mental fatigue (5) and vigilance (8) were always investigated in relation to time on task, sometimes in addition to task demands (vigilance was also investigated for the human factor cue utilisation).The difference between study designs for the two dimensions was that mental fatigue was usually studied with high task demands, whereas vigilance was studied with low task demands.All studies that examined task external demands reported the investigation of stress.However, this was also the case for some studies using task demands, event demands, interface design, and task format variables, making the distinction between stress (10) and mental workload rather ambiguous.Also reported was the analysis of situation awareness (Endsley 2021) as affected by interface design or predicted by gaze dispersion as well as of arousal (1) as a correlate of task performance.Finally, two studies examined the higher level concepts operator functional state (Hockey 2003) and cognitive readiness (Morrison and Fletcher 2002) in relation to task demands.

Empirical evidence
Having thoroughly mapped the available research, we conclude the review with the concrete empirical evidence obtained for individual physiological measures, providing insight into their observed reliability as a basis for their selection in future studies.The evidence is presented for the individual types of measures and structured following the research goals of validation, evaluation, and prediction.The summary focuses on results that were verified using inferential statistical analyses.Correspondingly, the reported presence or absence of an effect refers to the results of the inferential statistical analysis that estimated the studied effect to be significant or non-significant.In a few instances, conflicting information on statistical results was given in the articles, e.g. in tables and the full texts.Such analyses are omitted.

Validation
Table 2 presents the list of references for the empirical validation of physiological measures, for all measures that were successfully validated in at least one study.Studies are classified as either supporting the expected relationships between cognitive demands and physiological measures, or yielding either no effect or an effect contrary to expectation.For a quick overview, a visual summary of the respective number of studies is presented in Figure 4.The limited amount of research on blood pressure measure shows some support for increasing systolic pressure for higher task demands.However, this finding is only supported by half of the respective studies.Substantial evidence on electrocardiac measures indicates an increase of HR and a decrease of HRV with higher cognitive demands and a reverse pattern with time on the task.These well established effects are, however, far from being supported in every study, and there are also some studies showing reversed effects.The most amount of empirical evidence has been collected for pupil size assessments, which show a positive relationship with cognitive demands and a negative relationship with time on task.Other pupil related metrics, such as the fractal dimension of the pupil size time series, have gained empirical support in the few studies that applied them.
Measures relating to eye blinks, fixations, and saccades resulted in a very diverse set of empirical evidence.Although there is some support for fewer and shorter blinks, more and longer fixations, and more and longer saccades with higher cognitive demands, the conflicting evidence highlights the dependencies of these measures on the specific task and human-system interface.For gaze pattern analyses, several metrics have been explored, yet there is still comparatively little evidence for the individual options.This is also the case for respiratory and skin measures.The few studies that applied these methods reported an increase in respiration rate and amplitude as well as an increases in skin conductance level and skin blood flow with higher cognitive demands.Overall, a large number of studies reported significant associations between cognitive demands and physiological measures, supporting their use in assessing operators' mental state.However, the non-negligible number of studies that do not support or even contradict these findings creates a heterogeneous body of evidence.

Evaluation
Table 3 presents the list of references for studies using physiological measures to evaluate the effect of other variables on the operator's mental state.Studies are classified according to whether or not the physiological measures showed a significant effect of the respective predictor.In contrast to validation studies, where the effects of different cognitive demands on the user's mental state are strictly assumed, in evaluation research the relationship between predictor variables and mental state is only hypothesised.Thus, if a physiological measure does not show a significant effect, this may be attributable either to the fact that the measure is insensitive to variations in mental state or that the predictor variable actually has no effect on the operator.As a result, the empirical evidence collected in these types of studies requires a much more  � Studies that showed a significant effect in the opposite direction-positive (þ) and negative (-)-to that expected (Exp.in-depth investigation that is beyond the scope of this review.The list of referenced articles, however, includes a rich body of research that should be consulted when examining a particular physiological measure.

Prediction
Table 4 presents the list of references for studies linking physiological and performance measures.Studies that found significant relationships between the two for some, but not all, performance metrics are included in both columns.Although the results are similarly diverse to the validation and evaluation results, and the analytical approach of some studies makes interpretation of the reported relationships difficult (e.g.Hwang et al. 2008;John et al. 2022), there appears to be a noteworthy trend in the data.Physiological responses that have been shown to be associated with higher cognitive demands are also associated with better task performance.Specifically, studies have shown higher performance to be associated with higher HR (Hettiarachchi et al. 2021;Peifer, Sauer, and Antoni 2020), lower HRV (Jeroski et al. 2014), larger pupil size (Coyne, Foroughi, et al. 2017;Knisely et al. 2020;McIntire, McIntire, et al. 2014;Taylor 2015), as well as fewer and shorter eye blinks (Chen, Yan, and Tran 2019;McIntire, McKinley, et al. 2014).This is notable since many validation studies report lower task performance under conditions with higher cognitive demands, which may lead to the assumption of an opposing association between physiological measures and performance.
There are also some studies not referenced in Table 4, as they did not report significance levels for individual predictors.C. Shi and Rothrock (2022) used fixation and saccade related predictors in a logistic regression model on error rate.Xiong et al. (2023) used HR, eye closure, and head pose as inputs for a recurrent neural network to predict operator performance, whereas Hwang et al. (2009) used HRV metrics as inputs for a fuzzy inference model to predict team performance.Xinyao et al. (2020) split participants based on gaze dispersion, showing that the group with higher gaze dispersion made significantly fewer errors, while Sibley et al. (2015) reported descriptively that the standard deviation of the pupil size was smaller for the four top performing participants compared to the four lowest performing participants.Furthermore, Chamberland et al. (2018) showed larger increases in pupil size following warnings that were detected compared to warnings that were missed and Bhavsar, Srinivasan, and Srinivasan (2016) reported descriptive differences in the pupil size time series after a critical event between successful and unsuccessful event handling.Finally, Schmitz-H€ ubsch, Stasch, and Fuchs (2021) (1, 2) analysed the proportion of participants that showed differences in physiological measures between periods of low and high performance.
Overall, the majority of studies were able to identify significant relationships between physiological measures and human performance, but the small number of studies in relation to the range of different analyses approaches impairs the identification of indubitable effect patterns.

Discussion
Over the past decades, research on the physiological assessment of operators' mental state has steadily increased.The goal of this review was to collect, structure, and synthesise the existing empirical research to provide a foundation for future research on humancentered automation solutions that are enabled by physiological assessments.To this end, the review provides researchers with an analysis of (1) which types of physiological measures and specific metrics have been applied, (2) which application fields and corresponding operator tasks have been investigated, (3) which research goals have been pursued and which variables have been studied, (4) which mental state dimensions have been considered, and ( 5) what empirical evidence has been obtained.

Primary findings
The collection of PNS measures applied in supervisory process control tasks is consistent with those in other application areas (Charles and Nixon 2019;Pagnotta et al. 2021;Tao et al. 2019), with ocular and electrocardiac measures being the most prevalent and fewer studies applying blood pressure, respiratory, and skin measures.However, in contrast to previous reviews, ocular measures were found to be more common than electrocardiac measures, which matches the focus on visual processing requirements in most supervisory process control settings.Air traffic control and nuclear power plant operations continue to be the most studied application fields due to the high requirements for human performance to ensure system stability and safety.In addition, a range of continuous and discontinuous process applications have been covered in the literature.Of particular note is the emerging field of supervising automated agents, which ties back to the beginnings of supervisory control research, but transfers it from human-robot to human-multi-robot applications.
Most research focuses on the validation of physiological measures or uses them for evaluation.The fact that both research goals are being prominently pursued during the same time period illustrates, on the one hand, the substantial body of knowledge that has already been accumulated and, on the other hand, the still present need for additional insights into the relationship between the mental state and physiological responses.Furthermore, even in studies in which associations between physiological measures and performance are reported, performance prediction is often not part of the main research questions.Regarding mental state dimensions, the amount of corresponding research identifies mental workload as the central conceptual framework to be considered in supervisory process control.Mental fatigue and vigilance add important dimensions, as they highlight the interaction between cognitive demands and time on task.In contrast, stress was not well distinguished from mental workload in the applied experimental designs.Although situational awareness plays an important role in general human-automation research, the link to physiological responses is less clear and has therefore not been a focus of this particular research approach.
The collected empirical evidence on the relation between operator's mental state and physiological measures was consistent with previous review articles (Charles and Nixon 2019;Pagnotta et al. 2021;Tao et al. 2019).The large number of studies finding significant associations underscores the potential of using physiological measures in supervisory process control tasks, with pupil size measures yielding the most reliable empirical results.However, as with previous reviews, the overall evidence base is heterogeneous.One reason for this is the difference in research designs between studies, which hinders the synthesis of the findings.This includes studies reporting the same variables but manipulating them in different ways, e.g.task complexity relating to the number of control units or the frequency of incidents, or time on task relating to different time scales.Moreover, studies examining the same physiological response can quantify them differently, exemplified by the range of HRV indices.Other problems include the fact that some articles do not provide precise information on studied tasks and conditions (see also Pagnotta et al. 2021) or that experimental designs are used that confound relevant variables, such as task demands and time on task.
In addition to these general issues, there are also inherent characteristics of physiological measures that may contribute to the disparity in empirical results.Besides variations in the operator's mental state, physiological responses are influenced by other variables, such as physical workload and environmental factors including temperature, noise, and illumination levels (Sharples and Megaw 2015).This, along with differences in their physiological basis, may also cause results to differ not only for one measure type from study to study, but also between different measure types in the same study (e.g.Bernhardt et al. 2019;Matthews et al. 2015).Moreover, the complex and non-linear relationships between cognitive demands and dimensions of the operator's mental state complicate the validation of physiological measures.The resulting lack of a reference for validation is further exacerbated by the well-known problem of insensitivities and dissociations between all three types of mental state measures, i.e. performance measures, subjective measures, and physiological measures (Hancock 2017c;Hancock and Matthews 2019).Focusing on the results of prediction research, the observation that congruent physiological responses are often associated with higher cognitive task demands and better human performance highlights another issue in data interpretation.It is important to consider that not only high but also low demands can impair performance.This is emphasised by the fact that some of the studies in which the described relationship between physiological measures and performance was found examined vigilance tasks that impose comparatively low cognitive demands (Jeroski et al. 2014;McIntire, McIntire, et al. 2014, McIntire, McKinley, Goodyear, et al. 2014).Furthermore, research suggests that physiological measures are better understood as correlates of task engagement (e.g.Hopstaken et al. 2015aHopstaken et al. , 2015b) ) and invested effort (e.g.van der Wel and van Steenbergen 2018; Da Silva Castanheira, LoParco, and Otto 2021) rather than a simple subjective reflection of cognitive task demands.As a result, the confounding effect of higher task demands leading to both an increase in the effort level required to maintain task performance and an increase in the effort actually expended by the human can obscure the relationship between physiological measures and performance.
All of these issues add to the continuing difficulty in synthesising the empirical evidence on physiological measures as the basis for their effective application.

Opportunities for future research
In both validation and evaluation studies, physiological outcomes are validated or assessed against the cognitive task demands as predictors.A fundamental flaw of this approach is that these predictors cannot account for the variance that is introduced by each operator's individual characteristics (see human moderators in Figure 1).While this is consistent with the  goal of evaluation studies that examine system features and general principles across operators, it falls short in capturing the mental state as an individualspecific concept and the physiological response as an individual-specific outcome.However, this is exactly what is required for dynamic operator assistance, where the allocation of responsibility should depend on the state of the individual operator and their ability to cope with current cognitive demands.In other words, if physiological metrics are to be used to trigger task redistribution in human-automation interaction, they must be sensitive to the mental state of the operator rather than to cognitive task demands, and should thus be validated accordingly.Consistent with this line of reasoning, there has been a recent increase in researchers advocating the use of physiological measures as predictors of task performance measures as outcomes (Hancock et al. 2021;Longo et al. 2022;Wickens 2017).There are two main advantages of this approach.First, human performance comes closest to an objective ground truth when analysing the operator's mental state.Second, it is the most appropriate criterion to optimise for, since dynamic operator assistance aims to ensure system stability and safety by avoiding task conditions that impair the operator's ability to perform the task.The approach also raises important research questions, such as on (1) the shape of the relationship between physiological measures and performance, considering the nonlinear relationship of relevant mental state dimensions with performance, (2) the temporal correspondence between variations in specific physiological measures and performance, as well as (3) the relative ability of cognitive task demands and physiological measures to predict performance.As this review has shown that respective research designs have played only a minor role in the supervisory process control literature, it offers significant potential for future advancement.

Limitations and scope
The research synthesis is determined by the prior selection of research articles.Defining the scope of the review too narrowly leads to the exclusion of relevant evidence, whereas defining the scope too broadly compromises the effectiveness of the synthesis.The balance chosen for this review was to include a wide range of supervisory process control tasks to accumulate evidence across application fields, while excluding other domains of human supervisory control, such as aircraft control and autonomous driving.The review also included studies that pursued different research goals and examined different mental state dimensions, resulting in a broader scope than most previous reviews.The review can therefore provide a comprehensive synthesis to a certain level of detail, but it cannot present all potentially relevant details of the individual studies in an in-depth narrative synthesis or aggregate the empirical findings in a meta-analytical review.At the same time, the insights consolidated in this review can be extended by findings from other application domains, considering domain-specific differences.

Conclusion
This review provides researchers with a comprehensive overview of the empirical research on the physiological assessment of operators' mental state in supervisory process control tasks.The synthesis identified the use of a wide range of physiological measures, highlighting a focus on ocular and electrocardiac measures.These methods have been applied in a similarly wide range of application fields, spanning both continuous and discontinuous control environments.Most of the relevant research has been aimed at the validation of physiological measures to assess the effects of cognitive task demands on the operator, providing rich empirical support for their successful application, particularly for pupil size measures.Physiological measures have also been used in evaluation studies and, to a lesser extent, as predictors of human task performance.In terms of mental state dimensions, most researchers have focused on the analysis of operators' mental workload.Despite the identified empirical support for physiological measures, the aggregated evidence also reveals considerable heterogeneity in the results of the included studies.These discrepancies require further investigation to ensure that physiological measures can make reliable contributions in real-world applications.To this end, it is expected that the current state of knowledge can be advanced by placing greater emphasis on the study of physiological measures as predictors of human task performance as an outcome.

Figure 2 .
Figure 2. Flow diagram of the literature search.

Figure 3 .
Figure 3. Evidence maps for the physiological assessment of operators' mental state in supervisory process control tasks.The bubble plots show the number of studies that applied different types of physiological measures, differentiated by (a) application field and (b) research goal.The area of the bubbles is scaled by the number of respective studies.
1. (Psychophysiol � OR electrocardiogra � OR ecg OR 'heart rate' OR hr OR hrv OR 'respiratory rate' OR 'breath rate' OR 'electrodermal activity' OR eda OR 'skin conductance' OR 'galvanic skin' OR 'tissue blood volume' OR 'blood pressure' OR electrooculogra � OR eog OR eye-track � OR 'pupil diameter' OR 'pupil dilation' OR 'pupil size' OR blink � OR 'dwell time' OR fixation � OR saccade � ) AND 2. ('Control centre' OR 'control desk' OR 'control room' OR 'control station' OR 'process control � ' OR 'process monitoring' OR 'supervisory control � ' OR 'supervisory task � ' OR 'system control' OR 'system monitoring' OR 'power plant' OR 'traffic control � ' OR ('cabin air management' OR 'command and control' OR 'electricity network' OR 'energy network' OR 'gas refinery' OR 'oil refinery' OR 'rail control � ' OR 'unmanned aerial vehicles' OR 'vessel traffic'))

Table 1 .
Study characteristics, application field, investigated mental state dimensions, applied physiological measures, and research goals of the included studies.

Table 2 .
Empirical evidence on the validation of physiological measures.
). Event demands ¼ ) a significant effect in the opposite direction to that expected, aggregated across research goals.A single study contributes to the counts for an individual measure once per examined research goal.

Table 3 .
Empirical evidence from evaluation studies using physiological measures.

Table 4 .
Empirical evidence on the relationship between physiological measures and human task performance.