A virtual speaker in noisy classroom conditions: supporting or disrupting children's listening comprehension?

Abstract Aim: Seeing a speaker’s face facilitates speech recognition, particularly under noisy conditions. Evidence for how it might affect comprehension of the content of the speech is more sparse. We investigated how children’s listening comprehension is affected by multi-talker babble noise, with or without presentation of a digitally animated virtual speaker, and whether successful comprehension is related to performance on a test of executive functioning. Materials and Methods: We performed a mixed-design experiment with 55 (34 female) participants (8- to 9-year-olds), recruited from Swedish elementary schools. The children were presented with four different narratives, each in one of four conditions: audio-only presentation in a quiet setting, audio-only presentation in noisy setting, audio-visual presentation in a quiet setting, and audio-visual presentation in a noisy setting. After each narrative, the children answered questions on the content and rated their perceived listening effort. Finally, they performed a test of executive functioning. Results: We found significantly fewer correct answers to explicit content questions after listening in noise. This negative effect was only mitigated to a marginally significant degree by audio-visual presentation. Strong executive function only predicted more correct answers in quiet settings. Conclusions: Altogether, our results are inconclusive regarding how seeing a virtual speaker affects listening comprehension. We discuss how methodological adjustments, including modifications to our virtual speaker, can be used to discriminate between possible explanations to our results and contribute to understanding the listening conditions children face in a typical classroom.


Introduction
Background Oral instruction is a major part of how most children receive their basic education at school. For oral instruction to be effective, children have to be able to comprehend what the teacher is saying. Classrooms are however often challenging listening environments. Children have to listen and comprehend surrounded by competing speech that might mask or distract from their teacher's speech [1,2].
Noise in classrooms can have serious consequences, putting large groups of children at disadvantage. Academic performance of children aged 7-11 in the London area was negatively correlated with the noise levels in their schools [3]. In our study, the focus is on listening (passage) comprehension in noise, and how it is affected by seeing the face of a speaker.
It is well established that speech recognition, a prerequisite for comprehension, is impaired by noise. Speech recognition is typically assessed as the ability to repeat heard words [4] or sentences [5]. Important for our study, children's speech recognition has been found to be more sensitive to noise compared to adults [6]. Speech recognition does not require any semantic processing, but it can be mediated by semantic context [4].
Speech recognition can also be aided by visual context. School children can typically see their teacher speaking, which enables audiovisual integration that can affect speech processing in several ways. Adults' speech recognition in noise has been found to improve when a speaker's face is visible [7]. Seeing lip movements have a direct effect on speech recognition, that is not always beneficial for disambiguating speech. The so-called McGurk effect refers to how seeing synchronous lip movements affect perception of spoken syllables [8]. Besides lip reading, head movements coordinated with speech prosody ("visual prosody") have been shown to facilitate recognition of syllables in spoken Japanese [9].
In the current study, we compare effects on comprehension of speech matched with the movements of a digitally rendered face. Technological advancements have enabled computers to both produce speech and present a speaker visually. The terms Virtual Humans, Virtual Agents, and Embodied Conversational Agents [10] have been used to describe such implementations and their potential value in educational or instructional applications have been pointed out [11], as well as their importance as research tools [12,13]. Adults' speech recognition has been found to be facilitated with digitally rendered faces [14][15][16][17]. However, the facilitation is generally weaker than from a video recording of a speaker. Since the educational scenario that we aimed at creating in the current study involves no interaction but fixed roles of the teacher (speaker) and children (listeners), we use the term virtual speaker.
Recognition is, as mentioned, one important prerequisite to comprehend speech. Comprehension however also require listeners to extract and infer meaning from speech, as well as relating it to previous knowledge and encoding it for later retrieval. Even at noise levels where speech is accurately recognized, comprehension can be impeded. Studies have found that adults were less able to answer questions about content included in lectures they had heard in noise [18], and that the effect is stronger for children [19]. Similarly, another study assessed listening comprehension as the ability to follow oral instructions and found children (but not adults) to be affected in the presence of competing speech [20]. Yet another study tested children listening in noisy classroom conditions [21] on five different aspects of listening comprehension included in a standardized listening comprehension test [22] identifying the main idea, defining vocabulary, recalling details, inferring information, and identifying the most pertinent information. Most children performed below the standardized 95% confidence interval for their age on the three latter aspects [21].
In a review paper from 2012, Mattys et al. map how factors defining the speech signal, listening environment and listener capabilities affect comprehension differently, and discuss links to general cognitive capacities [23]. It has been demonstrated that strong working memory capacity can alleviate the negative effect of noise on reading comprehension [24] and that comprehension of both typical and hoarse voices was positively correlated with performance on the "Elithorn's Mazes" test (probing executive functioning including attentional and inhibitory control) [25]. Increased cognitive load also requires listeners to expend more effort, possibly inducing fatigue and stress. Thus, presence of competing speech has not only been linked to decreased performance but also to both physiological indices of effort [26] and self-reported perceived listening effort and "frustration" (interpreted as indicator of stress) [27].
In this context, effective audiovisual integration might challenge a child's cognitive capacities, and make the visual channel a distraction rather than a support. One study found executive functioning in adults to strongly predict "visual enhancement" of the ability to recognize spoken sentences in competing speech [28]. Moreover, at least one study has found that visual presentation of the speaker increased load on working memory [29]. Another study investigated listening effort, and found no general effect of visual presentation, but that participants with strong lipreading ability and working memory capacity were more likely to benefit from visual cues, when listening in multitalker babble noise. The authors interpret their results as indicating that audio-visual integration depends on more general cognitive resources [30].
The study presented in this paper investigates how children's listening comprehension is affected by multi-talker babble noise with or without visual presentation of a speaker. For our implementation, we used a virtual speaker, speaking with a hoarse voice. Attempting to compensate for a noisy classroom environment, teachers often strain their voices to a point where the voice becomes hoarse. Speakers generally tend to speak about 10 dB louder than the background and in a higher pitch [31]. A survey of 22 randomly selected schools throughout the south of Sweden found that 37% of teachers suffered at least occasionally from voice problems [32]. The rationale behind the use of a degraded voice was thus to preserve ecological validity as well as to make our results comparable to parallel studies using similar methods [33]. The aim of the study was not to evaluate effects of voice quality. It is, however, worth noting that degraded voice quality can itself (in silent background) increase listening load [34] and impede children's comprehension [35].

Study objective
The main purpose of the current study was to investigate if a virtual speaker can serve as a visual support for listening comprehension in adverse conditions that approximate a classroom environment.

Research question 1
Will children's comprehension and perceived listening effort suffer when listening to a narrative read in multi-talker babble noise compared to in a quiet setting.

Research question 2
Is children's comprehension and perceived listening effort mitigated or accentuated when they can see a virtual speaker?

Research question 3
What role does executive functioning play for children's comprehension and perceived listening effort?

Listening comprehension
To investigate our research questions regarding effects of noise, visual presentation of the speaker and executive functioning on listening comprehension we used material from the listening comprehension part of the Clinical Evaluation of Language Fundamentals (CELF4), a commonly used assessment tool for language skill [36]. The listening comprehension part of CELF4 involves listening to short narrative texts. Each narrative is accompanied by five questions. The questions are divided into three "content" questions (about details or facts explicitly mentioned: "What did the students receive more than pizza and soft drinks?"), one "inference" question (requiring extrapolation based on the content, "Why do you think Pricken ran in the opposite direction?") and one "summary" question (a simple question about the overall theme of the narrative, "How do you think it ended?"). Predefined guidelines for scoring (defining correct versus incorrect answers) are provided for each question. Due to the non-specific answer definitions, the summary questions were not included in our analysis, except as a basis for excluding outliers.
For the study, we chose three narratives intended for the age group 9-10 years and one "exercise" narrative of comparable difficulty. The narratives were presented in one of four conditions, defined by two 2-level variables: mode of presentation (audio-only (A)/audio-visual (AV)) and auditory setting (quiet (Q)/multi-talker babble noise (N)). The narrator's voice used in all conditions was recorded using a head mounted microphone (Lectret HE-747) and sampled in 16 bit by a rate of 44.1 kHz. The narrator had prior to all used recordings undergone a vocal loading procedure to induce hoarseness [37]. Speech rate was similar for the narratives (around 165 words or 275 syllables per minute). The levels of the voice recordings were normalized to be equal by the root mean square (RMS), using Adobe Audition CS6. The background noise used in the A-N and AV-N conditions was constructed from recordings of four children reading separate stories. These recordings were normalized and combined, and the resulting multi-talker babble noise track added to the voice track with a signal-to-noise ratio (SNR) of þ10 dB. The purpose of the multi-talker babble noise track was to approximate classroom conditions, with several competing simultaneous speakers.

Virtual speaker
In order to study the effect of visual presentation of the speaker, we created a virtual speaker based on facial and postural animation captured at the same time as the voice recordings using an ASUS Xtion Pro Live 3 D-sensor. The 3 D-sensor captures depth maps in 640 Â 480 pixels resolution via an active infrared sensor as well as 1280 Â 1024 pixels (RGB) video at 30 frames per second. Facial animation, orientation of head and torso, and gaze direction was extracted in Faceshift (software specialized for facial motion capture). The captured movements were then implemented on a digital character generated with Autodesk Character Generator, and video frames (1024 Â 768 pixels) of a frontal view of head and upper torso of the virtual speaker rendered in Autodesk Maya 2014 ( Figure 1). Finally, the video and audio tracks were combined into video files (AVI multimedia container format) using Avidemux 2.5, with Xvid video compression and uncompressed audio. The fidelity of the lip movements in the final videos was deemed sufficient by an expert lip reader (post-experiment evaluation), though a minor issue with some pronunciations of/f/was noted.

Perceived listening effort
To probe perceived listening effort, we used a short self-report questionnaire. The first question (Q1) was "How did listening to this text make you feel?" ("Hur k€ ande du dig n€ ar du lyssnade på den h€ ar texten?"). The second question (Q2) was "Did you think the task was easy or difficult?" ("Tyckte du att den h€ ar uppgiften var l€ att eller svår?"). The questions were formulated as simply as possible in order to be administrable to children. The children responded to the two questions using Visual-Analogue Scales (VAS) [38]. Endpoints were represented by negative and positive emoticons and responses were sampled by measuring (in millimeters) where on the lines the children had made a mark. Obtained values thus ranged from 0 that was the most negative (effortful) to 100 that was the most positive (least effortful).

Executive functioning
We also included the Elithorn's Mazes (EM) test of executive functioning (part of the Wechsler Intelligence Scale for Children [39]). The test involves completing mazes of increasing difficulty, by connecting the correct number of dots without lifting the pen from the paper. The possible scores (including time bonus) range from 0 to 56 (best). The reason to schedule this test at the end of the experimental procedure was to avoid fatigue during the main listening comprehension part.

Participants
We conducted an experiment with children recruited from elementary schools in the Scania region in the south of Sweden. Out of the 61 participants, six were excluded (three failed to pass the hearing screening (described below), one did not complete the EM test, and two scored abnormally low on the listening comprehension test and also failed on more than half of the "summary questions". The median age of the 55 included participants (34 female, 21 male) was 104 months, ranging between 100 and 111 months.

Procedure
The procedure included a short hearing screening using a Grason-Stadler GSI 66 audiometer and TDH 39 headphones. Hearing levels poorer than 20 dB Hearing Level on any of the frequencies 0.5 kHz, 1 kHz, 2 kHz, 3 kHz, 4 kHz, 6 kHz, or 8 kHz resulted in exclusion. The screening was followed by the actual listening comprehension test. The two variables "mode of presentation" and "auditory setting" were fully crossed; each participant received one narrative in each of the four conditions: A-Q (audio-only, quiet), A-N (audio-only, noise), AV-Q (audio-visual, quiet), and AV-N (audio-visual, noise). The occurrences of the different narratives in the different conditions, and their order, were systematically varied using Latin square arrays. All conditions were presented on a laptop with circumaural sound-attenuating Sennheiser HDA 200 earphones. The speech signal was presented at 65 dB Sound Pressure Level (SPL). In noisy auditory settings, the babble noise was presented at a SNR of þ10 dB. The equipment was calibrated according to IEC 60318-2 and ISO 389-8 with a Br€ uel & Kjaer 2209 audiometer and a 4134 microphone in a 4153 coupler (IEC: 1998, ISO: 2004). A 1 kHz tone with the same average RMS as the speech signal was used to verify the SPL for speech and background noise.
Answers to the questions related to the listening comprehension narratives were given verbally and transcribed by an experimenter, directly following the presentation of each narrative. The two additional questions probing perceived listening effort were asked after each narrative, following the CELF questions. After the completion of the four narratives that constitutes the listening comprehension part (including the two questions targeting perceived listening effort), participants proceeded to the EM test.
In summary, for each participant we measured listening comprehension (of explicit or inferred content) of narrative texts read in noisy (N) or quiet (Q) settings, with (AV) or without (A) visual presentation of a virtual speaker. Next, as secondary outcome variables, each participant answered the two self-assessment questions targeting perceived listening effort. Finally, executive functioning was measured by means of Elithorn's Mazes (EM), constituting a continuous between-subject variable.

Ethical considerations
The experimental procedure was designed to not be too taxing on the participating children, who were also informed that they were free to leave at any point. Sound levels were calibrated below any potential hazardous level. Children that were excluded from the analysis due to not passing the hearing screening were still allowed to perform the experiment in order to avoid any feeling of being left out. Informed consent from the children's legal guardians was obtained via forms distributed and collected well before the actual data collection. The study adheres to the Helsinki ethical guidelines. Identities of the participating children were anonymized directly following the data collection; identification keys and original data collection forms were stored separately in locked cabinets.

Statistical analysis
We analyzed the participants' scores on the different categories (content and inference) of the listening comprehension questions and the two self-assessments of perceived listening effort as mixed-design full-factorial models with three factors. Specifically, our models were built with auditory setting (Q as reference level) and mode of presentation (A as reference level) as within-subject factors and EM score (centered on global mean) as a between-subject factor (see Equation 1). All analyzes were performed in R [40]. Regression analyses were performed using the lmerTest package [41], and the coefficient of determination of mixed models (conditional R 2 ) [42] were calculated using the MuMIn package [43]. Since some children had to be excluded, the combinations of narratives and conditions were not entirely balanced (with regard to our use of a Latin square design), which prevented us from ruling out possible effects of specific narrative or order. However, in the official instructions, the listening comprehension scores of the different narratives are summed uniformly towards the total CELF score. We therefore proceeded with our planned mixed-design analyses.

Listening comprehensionthe content questions
The proportions of correct answers to the content questions were analyzed as a mixed linear regression model (see Equation 1, conditional R 2 ¼ 0.226). The analysis revealed a strongly significant negative effect of noise (b ¼ À0.265, t ¼ À4.96, p < .001), but no significant effect of audio-visual presentation (b ¼ À0.036, t ¼ À0.674, p ¼ .50. There was also a marginally significant interaction between mode of presentation and auditory setting (b ¼ 0.133, t ¼ 1.76, p ¼ .080), indicating that presentation of the virtual speaker somewhat reduced the negative effect of noise ( Figure 2). Furthermore, our analysis found a positive effect of EM score in quiet settings (b ¼ 0.013, t ¼ 2.20, p ¼ .029), together with an opposite (negative) interaction effect of EM in noise (b ¼ À0.017, t ¼ À2.11, p ¼ .037). This indicates that the observed benefit of strong executive functioning with regard to listening comprehension (content) in quiet settings was not present in noise ( Figure 3). No significant effect was found for the three-way interaction between EM, auditory setting, and mode of presentation

Listening comprehensionthe inference question
Since the four experimental conditions each yielded only one correct/incorrect data point, proportions of correct answers to the inference question were analyzed as a logistic regression model (see Equation (1), conditional R 2 ¼ 0.059). We did not find any significant effects neither of auditory setting (b ¼ 0.299, Z ¼ 0.627, p ¼ .53), mode of presentation (b ¼ 0.121, Z ¼ 0.258, p ¼ .80) nor their interaction (b ¼ 0.035, Z ¼ À0.520, p ¼ .60). We found however, a positive effect of EM score (b ¼ 0.112, Z ¼ 2.08, p ¼ .038) and an opposite (negative) interaction effect of EM with mode of presentation (b ¼ 0.169, Z ¼ À2.36, p ¼ .018), indicating that there is a benefit of strong executive functioning with regard to listening comprehension (inference) only when listening without visual presentation of the virtual speaker ( Figure 4).
No significant effect was found for the three-way interaction between EM, auditory setting, and mode of presentation

Perceived listening effort
The VAS ratings related to perceived listening effort were analyzed as mixed linear regression models. The ratings of Q1 ("How did listening to this text make you feel?", conditional R 2 ¼ 0.572) did not yield any significant effects. All conditions produced ratings around 80% towards the positive end. For the ratings of Q2 ("Did you think the task was easy or difficult?", conditional R 2 ¼ 0.385) we found a strongly significant negative effect of auditory setting (b ¼ À17.77, t ¼ À3.

Summary of results
The presence of multi-talker babble noise when listening to the narratives impaired the participating children's performance on the related listening comprehension content questions, and made them perceive the task as more difficult. The impaired performance on content questions after listening in noise was somewhat mitigated by audio-visual presentation. Furthermore, children who scored high on EM also performed better on the listening comprehension content questions but only in the absence of noise, as well as on the listening comprehension inference question with audio-only presentation.

Discussion
We investigated how seeing an animated virtual speaker affected listening comprehension and perceived listening effort (research question 1) with or without background babble noise (research question 2), and how comprehension in these conditions might be related to executive functioning (research question 3) in children. Our results were partly inconclusive.

Research question 1
Listening comprehension was impaired by background noise, as was to be expected from studies finding noise to adversely influence speech recognition [6] and comprehension [18][19][20].
Note that the SNR used (þ10 dB) was generous compared to many previous studies [18,19,21], but we still observed a clear effect on comprehension, compared to no noise. The degraded quality of the hoarse voice used in all conditions probably added to the difficulties of listening in noise, if we extrapolate from the results from some of our earlier studies [1].

Research question 2
The effect of noise tended to be reduced by the visual presentation of the virtual speaker, but our results only showed a marginally significant effect. Visual support only compensated for about half of the reduction in performance on the content questions, when listening in noise. Previous findings indicating that seeing a speaker's face (virtual or real) facilitates speech recognition [14][15][16][17] support the prediction of a general positive effect of audio-visual presentation on comprehension, which obviously in part depends on recognizing the verbal content of the speech. Visual presentation has also previously been found to facilitate comprehension of semantically challenging sentences [44]. The absence in our results of any effect of mode of presentation in quiet settings could indicate a ceiling effect; without distracting noise, recognition of spoken content was no bottleneck for successful comprehension. The inconclusive effect of audio-visual presentation in background noise could have different explanations. Apart from recognition, comprehension also depends on several parallel processes and, although visual sensory information is integrated already at an early stage of speech perception, studies of audio-visual integration indicate that it is sensitive to perceptual and cognitive context. The McGurk effect has for example been found to be more dominant in a noisy environment [45], but less prevailing as cognitive load increases [46] or in situations where visual attention is divided [47]. Previous studies have indicated a complex role of visual information in speech processing [30], and that audio-visual integration can result in an increased load on resources manifested by hindered performance on a parallel memory task [29]. Our participants' age (8-to 9-year-olds) might also have affected the result. Previous research shows age dependent effects of audio-visual speech. One study found 5-to 9-yearolds to be less sensitive to audio-visual verbal distractors compared to both 4-year-olds and 10-to 14-year-olds while performing a picture naming task [48]. In another study, 10-to 11-year-olds reported it to be more difficult to hear their teacher's speech when they could not see her/his face [49]; this in contrast to 6-to 7-year-olds who did not report any such difficulties. These findings indicate that we possibly would have found a stronger effect of audio-visual presentation if our participants had been a few years older.
Qualitative aspects of the presentation of the virtual speaker possibly prevented benefits that could have been observed by visual presentation of a real speaker As mentioned, our lip-reading expert did note minor issues with some pronunciations of /f/. The appearance of the virtual speaker might also have distracted the participating children. The term "uncanny valley" is used to discuss how virtual characters that approach (but not quite replicate) human visual appearance may be perceived as unpleasant [50]. The virtual speaker in the current study was therefore designed not to appear far too photorealistic to decrease the risk of "falling into" the uncanny valley. It is also worth noting that the recorded speaker was reading the texts from a piece of paper placed next to the camera rather than addressing a listener. This could also add to an unnatural appearance, as much of non-verbal behavior typically present in face-to-face communication (such as hand gestures) was not produced. The responses to the first question addressing perceived listening effort (Q1) did, however, not indicate any adverse reactions to seeing the virtual speaker. We are currently working on a follow-up study comparing the virtual speaker with the real video of the actual speaker recorded in parallel to the motion capture recordings. Preliminary results indicate that the benefit of seeing the virtual speaker when listening in noise is at least at the same level as seeing the actual speaker [51].
On the other hand, if the audio-visual presentation of the speaker induces an increased "perceptual load" analogous to discrimination based on conjunction of features, perceptual load theory would predict that the visual information should facilitate semantic processing [52]. With less available resources to process the noise, it would thus be "filtered" out at an earlier perceptual stage. It is however unclear if perceptual load has such an effect across modalities [53].

Research question 3
The results of the content questions indicate that a strong capacity for executive functioning (as measured by EM) did facilitate comprehension in quiet auditory settings. The material was presented with a (provoked) hoarse voice in all conditions, and our finding that processing of hoarse speech can be facilitated by strong executive functioning is in line with previous results [54]. In the presence of background noise, however, no such benefit was observed. We have no explanation for this observation, and refrain from speculation.

Limitations
The inference question results showed no significant effects of noise or visual presentation of the virtual speaker, but a benefit from strong executive functioning in the absence of the virtual speaker. The absence of any effects of noise for the interference question may be due to the fact that the inference questions were too generally stated (in the Swedish translation of CELF4), making it possible for the children to infer the answers without extracting the information directly from the narratives. It is also worth noting that the analysis was based on only one question per child and condition. A previous study had indeed observed impaired performance (compared to standardized confidence intervals) on questions requiring their participants to "inferring information" [21].
The perceptual and cognitive load associated with other aspects of listening comprehension such as identifying the most pertinent information [21] or following verbal instructions [20] may differ from the questions related to narrative content tested in the current study. A more differentiated operationalization, accounting for different aspects and involved sub processes is needed to specify how visual information can support listening comprehension under different conditions.
Another limitation of the study was the low number of participants in combination with rather few data points per participant and condition and especially so for the inference questions. Furthermore, the combination of narratives, order and conditions was as mentioned not perfectly balanced. All conditions were presented with a hoarse voice (to establish a higher ecological validity), and we cannot assume that our results also hold for listening to typical (clear) voices. Extending the experimental design with control conditions with a clear voice could have made our results more revealing. However, the number of available narratives for the age group would then have forced us to abandon the withinsubject design, risking loss of statistical power.

Future work
Our future plans include, besides the aforementioned follow-up study comparing effects of real and virtual speakers, to further investigate how audiovisual information can support listening in situations with surrounding noise. Factors defining the appearance and movement of virtual speakers can be systematically varied, while still be presented in rich, multimodal contexts. This can allow us to investigate the contribution of different aspects such as lip-movements, "visual prosody", or distracting visual features. These are all factors that would be difficult to control systematically in natural face-to-face settings or using video stimuli. We also want to develop an applicable objective measure of cognitive load by introducing a secondary task compatible with the virtual speaker paradigm. Performance on parallel tasks performed while listening is indicative of remaining available cognitive resources. One example of such a parallel task is to connect items by drawing lines on a paper [34], something that requires vision. The challenge would be to find a task that is possible to perform while also looking at a speaker.

Conclusions
Our results indicate that children listening in noise to some extent benefit from seeing a speaker's face, but the results are inconclusive. We consider virtual speakers as promising research instruments to help us disambiguate between possible explanations of our results and to contribute to the understanding of how audio-visual integration works in adverse and ecologically valid listening environments.