Development and Evaluation of Video Recordings for the OLSA Matrix Sentence Test

One of the established multi-lingual methods for testing speech intelligibility is the matrix sentence test (MST). Most versions of this test are designed with audio-only stimuli. Nevertheless, visual cues play an important role in speech intelligibility, mostly making it easier to understand speech by speechreading. In this work we present the creation and evaluation of dubbed videos for the Oldenburger female MST (OLSA). 28 normal-hearing participants completed test and retest sessions with conditions including audio and visual modalities, speech in quiet and noise, and open and closed-set response formats. The levels to reach 80% sentence intelligibility were measured adaptively for the different conditions. In quiet, the audiovisual benefit compared to audio-only was 7 dB in sound pressure level (SPL). In noise, the audiovisual benefit was 5 dB in signal-to-noise ratio (SNR). Speechreading scores ranged from 0% to 84% speech reception in visual-only sentences, with an average of 50% across participants. This large variability in speechreading abilities was reflected in the audiovisual speech reception thresholds (SRTs), which had a larger standard deviation than the audio-only SRTs. Training and learning effects in audiovisual sentences were found: participants improved their SRTs by approximately 3 dB SNR after 5 trials. Participants retained their best scores on a separate retest session and further improved their SRTs by approx. -1.5 dB.


Introduction
Speech audiometry is an essential element in audiology [1] [2]. It assesses the ability to understand speech acoustically, which is crucial for human communication. The matrix sentence test (MST) is a well-established method to measure speech audiometry, and it exists in several languages [3]. MSTs use sentences of 5 words with a "noun -verb -number -adjectiveobject" structure. There are 10 possible words for each word category, e.g. 10 nouns, 10 verbs, etc., which are combined to create unpredictable and semantically correct sentences.
Although speech can be understood through sounds only, it is a multimodal process. Being able to see the speaker provides additional cues such as lip movements, which make speech much easier to understand [4]. Audiovisual speech perception has been mentioned as a predictor of real-world hearing disability [5] but it is usually not considered in audiometry. In particular, it is crucial for evaluating severely impaired listeners, where speech communication can be supplemented with visual speech. The MST is intended as a speech test for severely impaired listeners, therefore an audiovisual version is an important extension for its applicability. Nevertheless, only audiovisual (or auditory-visual) MSTs in Malay, in New Zealander English and in Dutch have been developed [6][7] [8].
The ability of speechreading (most commonly known as lipreading) plays a key role in audiovisual speech tests. In particular, audiovisual MSTs are highly conditioned by the speechreading ability. In the Malay MST [6] young normalhearing participants scored from 25% to 85% speech reception just by speechreading, i.e., in the visual-only condition. Even when using virtual characters in the German MST [10], a cochlear-implanted participant could understand up to 73% of the visual-only speech. Such visual-only scores mean that the participants are able to understand speech without any acoustic cues. Therefore a floor effect in the audiovisual MSTs exists: even if speech is completely masked by noise and not heard, participants achieve their visual-only speech reception threshold. Preprint G. Llorach et al.
Recording and validating a MST is quite an extensive work: selection of the phonetically-balanced speech material, recording of the speech, cutting and processing the sound files, making each word equally intelligible to the others, evaluation and validation [3]. When recording an audiovisual MST, further considerations need to be taken into account, such as the head movement and facial expression of the speaker [6]. Because audio-only speech tests already exist and have been extensively used, it is worth to reuse the audio material in audiovisual tests to keep validity across studies. An approach has been proposed by [9,10], where virtual characters with lip-synchronization have been used together with existing audio-only speech tests. The advantage of virtual characters is that they can be reused in different configurations with relatively little effort [11]. Nevertheless, the contribution of the lip/facial animation of a virtual character is unknown if it is not validated or referenced to a real speaker. Another possible approach is to create video recordings dubbed with existing audio for speech tests. In this case, the video recording usually provides better quality and realism of the visual speech than a virtual character. Nevertheless, asynchronies between the audio and the video have to be kept below the 100 ms in order to pass unnoticed and to not affect speech intelligibility [12] [13].
One of the advantages of MSTs is that consecutive trials can be done, as the sentences are unpredictable and there are too many word combinations to be memorized. Nevertheless, participants can learn and improve their results because of the sentence structure and the limited number of words of the MST. This training effect has already been shown in audio-only MSTs [14] [15] and is particularly relevant in the first trials, where differences of about 1 dB in SRTs are expected. After 2-4 trials there is usually an absolute improvement of 2 dB and the training effects in the following trials are quite small. Regarding the audiovisual training effect, it is expected that the participants further improve their SRTs over the trials by getting familiarized to the speaker and the visual material [16].
Another factor to take into account in MSTs is the way participants answer to the test. After hearing a sentence of the MST, participants either repeat what they heard (open-set response format) or select the answers from all possible words (closed-set response format). In the open-set format, a researcher has to be with the participant in order to assess if the answer of the participant is correct. Contrary to that, in the closed-set format the participants can do the test by themselves by selecting the answers from a list of all possible 50 words. Therefore, in this format participants have expectations about the words to be heard. SRTs have been found to be lower with closed-set type in some MSTs, e.g. [17][18], although not for German and other languages [3]. Whether such effects appear in audiovisual MSTs has not yet been investigated.
In this work we created an audiovisual version of the female German MST (AV-OLSAf). We recorded videos with a female speaker, dubbed them with the original sentences of the female speaker [15] [19] and evaluated the material. Our first contribution is the methodology for producing the dubbed videos and getting the best synchronized video recordings. Our second contribution is the evaluation of the AV-OLSAf with normal-hearing listeners in different conditions: we show the audiovisual training effects in the open-set and closed-set response, the speechreading scores and its effect in the audiovisual SRTs, and compare between the audio-only and audiovisual SRTs in noise and in quiet conditions. To conclude, we discuss the implications and recommendations for using the AV-OLSAf.

Recording the Video Material
Although in theory there are 100.000 possible sentences (5 word categories with 10 words per category), the female OLSA uses only 150 predetermined sentences. This relatively small number of sentences permitted us to record videos of the spoken sentences in a single afternoon. We were able to recruit the same speaker that recorded the audio-only version of the German female MST (OLSA) [15] [19]. During the recording session, the female speaker had to speak the sentences simultaneously while hearing them through an earphone on the right ear. Each sentence was played five times consecutively. Three short "beep" signals were given before each repetition started. The first repetition was used as a reference, i.e. the speaker had to listen only in order to know the sentence that was coming. In the remaining 4 repetitions she had to speak simultaneously while hearing the sentence.
The videos of the female speaker were recorded in the studio of the Media Technology and Production of the CvO Universität Oldenburg. The available lights of the studio were set up to achieve a homogeneous illumination of the face and of the background green chroma key, as shown in Figure 1. The videos were recorded with a Sony α7S II camera at 50pfs / full HD, and a condenser microphone in front of the speaker at the height of the knees. The speech was recorded in one channel with a 48 kHz sampling rate and a 16 bit linear pulse-code modulation (LPCM) sample format. An image sample of the final video recordings is shown in Figure 1.
As shown in the schema in Figure 1, a computer was used to reproduce the original OLSA sentences, which at the same time was sending a linear time code (LTC) signal to the second audio channel of the camera. This way the recorded speech of the session and the original sentences could be synchronized. The recording session lasted around 2 hours in total.
Development and Evaluation of Video Recordings for the OLSA Matrix Sentence Test Preprint The red line represents the earphone cable that was reproducing the original sentences. The black cable represents the LTC signal that was sent from the computer to the camera for synchronization. On the bottom, example of a frame of the video material.

Selection of the Videos
We discarded manually the videos where the speaker was smiling or showing other non-neutral facial expressions. The recorded speech signals were synchronized to the reproduced original sentences using the LTC signal. When dubbing speech there are inevitable asynchronies: time offsets (words spoken too early or too late) and/or words spoken slower or faster than the original words. As all these asynchronies could happen in one single sentence, we used dynamic time warping (DTW) [20] to find the best match between the recordings and the original sentences. The algorithm compares two temporal signals and provides a warping path. We computed the mel spectrograms of the signals and used them for the DTW function. The mel spectrograms were done using frame windows of 46ms with a frame shift of 23 ms. An example of the mel spectrograms and the corresponding warping path can be seen in Figure 2. Once the warping path was calculated, we used equations 1 and 2 to compute the asynchrony score: for = 1,2,3, … ,150 (original sentence number) and = 1,2,3,4 (recording nº per original sentence) where the ( ) is the mel spectrogram of the original sentence , ( ) is the mel spectrogram of its corresponding recording (4 recordings per sentence ), is the frame number of the , is the frame number of the , ( ) is the warping path between the mel spectrograms in frames, is the root mean square and ( ) is the asynchrony score between the ℎ original sentence and the ℎ recording of that sentence. The asynchrony score can be further expressed in seconds instead of frames as it represents a temporal difference. We checked the sensitivity of this measure by comparing each recording to its corresponding original sentence and to the remaining unmatching original sentences ( Figure 3). For each original sentence we chose the video recording with the smallest asynchrony score. Out of the final best selection, we found three outliers with an asynchrony score greater than 60 ms that had to be manually corrected with time offsets. Once corrected, these outliers were shown to 5 normal-hearing participants in comparison to the best matched sentences. The outliers could not be distinguished from the best matched sentences and no asynchronies could be noticed. We decided that the asynchrony score between the dubbed videos and the corresponding original sentences was small enough (~40 ms) to avoid or at least minimize any perceptual asynchrony/dubbing effects [12] [13]. Therefore, we proceeded with the evaluation of the material. The final video recordings can be found in [RefToZenodo].

Evaluation of the Audiovisual Material
2.3.1 Participants. 28 normal-hearing participants took part in the evaluation measurements. They were between 20 and 29 years old (mean age 24.9 years), half of them were female. They had normal or corrected-to-normal vision and their pure tone averages (PTAs) of the better ear were between -5 and 7.5 dB HL (mean -0.31 dB HL). The PTAs were computed using the frequencies 0.5, 1, 2 and 4 kHz. Participants were recruited through the database of Hörzentrum Oldenburg GmbH and were paid an expense allowance. Ethics permission was granted by the ethics committee of the CvO Universität Oldenburg.

Setup.
Participants were seated in a chair inside a semiacoustically treated booth. The evaluation measurements were done with binaural headphones (Senheiser HDA 200). A touchscreen display of 22'' with full HD (ViewSonic TD2220, ViewSonic Corp. Walnut, CA, USA) was placed in front of the participant within arm's reach. The screen display was at a height of 0.8 meters. The experiment was programmed in Matlab2016b. The videos and original sentences were reproduced with VLC 3.03. The acoustic signal was routed with RME Total Mix with an RME Fireface 400 sound card.
The acoustic levels were calibrated using a sound level meter placed at the approximate head position where participants would be seated. The sound and video reproduction was calibrated for synchronization using an external camera. We reproduced a video with frame numbering together with a LTC signal as acoustic signal through the experiment audiovisual reproduction system. The external camera was recording the display screen of the experiment. The LTC signal was connected directly to the external camera instead of the experiment loudspeaker. Using the recording of the external camera we found a consistent asynchrony of 80 ms, which had to be corrected by delaying the audio signal in the programming script.

Stimuli.
The acoustic stimuli was the female version of the German matrix sentence test (OLSA) [15] [19] and the visual stimuli was the best matched video recording (see Section X). The video recordings had a green background. For the conditions with noise we used continuous test-specific noise (TSN) based on the female speech material. The presentation level of the noise was kept constant at 65 dB SPL. The speech level of the first sentence was 60 dB SPL both for speech in noise and speech in quiet conditions. The adaptive procedure used varied the speech presentation level depending on the respective responses of the participant.

Conditions.
There were nine conditions in the experiment (see Table 1). Each condition used a list containing 20 sentences. The sentences contained in each list are predefined by the MST.

Development and Evaluation of Video Recordings for the OLSA Matrix Sentence Test Preprint
In total we used 45 different predefined lists. The speech presentation levels were adapted after each sentence in order to reach an individual SRT of 80% word scoring, i.e. 4 out of 5 words to be correctly recognized per sentence, according to the method by [21]. During the open-set response format, participants had to repeat orally what they understood after each sentence. In the closed-set response format participants chose the words they understood from an interface displayed on the touch screen after the stimuli presentation. The closed-set interface showed all possible 50 words plus a no-answer option per word category. In the visual-only condition (VONoiseClosed) there was no acoustic signal for speech but test-specific noise at 65 dB SPL. In this condition the speech could only be understood through speechreading. The measure of this condition was the average of the percentage of words correct per sentence over 20 sentences (a list). In all conditions, no feedback was given about the correct or wrong responses.

Training.
In order to assess the training effects of the AV-OLSA we added training trials prior to evaluating the nine conditions. Additionally we tested the participants in two different sessions (test, retest). In the first session the training consisted of 4 audiovisual in noise lists (80 sentences). Participants were randomly assigned to do the 4 training lists either in open-set or closed-set formats (AVNoiseClosed or AVNoiseOpen). 13 participants completed the training in the closed-set format and 15 participants did it with the open-set format. In the second session the training was a single list with the same condition as they did in the first session (20 sentences in AVNoiseClosed or AVNoiseOpen). The test and retest sessions were temporally spaced from one day to two weeks.

Procedure.
After the training lists, the participants proceeded with the conditions that had the same response format (open-set or closed-set) as assigned in the training. Next, they did the conditions with the opposite response format. The conditions that had the same response format were presented in pseudorandomized order (Figure 4). On the retest session they did a training list with the same response format as on training trials of the first session. They continued with the conditions with that same format before doing the ones with the other format, as on the test session.

Ceiling Effects
When analyzing the results, we found some unexpected SNRs and speech presentation levels in quiet: participants reached SNRs below -20 dB and speech presentation levels below 0 dB SPL (no sound pressure) in the audiovisual conditions. At these levels there is no contribution of acoustic information to speech reception: the speech detection threshold for the female OLSA is around -16.9 dB SNR in audio-only tests with TSN [22], a threshold that can be theoretically lowered by around -3 dB when adding visual speech [23]. Therefore, below these thresholds (-20 dB SNR and 0 dB SPL) participants used only visual speech in this experiment, i.e., they were speechreading. In consequence, the scores below these thresholds did not represent anymore the audiovisual speech perception, rather the visual-only. Figure 5 shows that during the adaptive procedure participants could reach levels where there was no acoustic contribution.
For the analysis of the data we decided to limit the values that were below the acoustic speech detection thresholds, as they were not representative of audiovisual speech reception. In total 18 out of 366 audiovisual trials (4.9%) were modified by limiting their SRTs to -20 dB SNR for speech in noise and 0 dB SPL for speech in quiet. The trials affected had different conditions: out of the 18 trials, 3 were training trials, 5 AudiovisualNoiseOpen, 5 AudiovisualNoiseClosed, 3 AudovisualQuietOpen, and 2 Preprint G. Llorach et al.
AudiovisualQuietClosed. Out of the 28 participants, 6 were able to go below the speech detection thresholds.

Training Effects
There was a general tendency that participants improved their audiovisual in noise SRTs over trials. Figure 6 shows the SRTs during the training trials and the test and retest for the audiovisual in noise condition. On average, participants improved their SRTs by -1.6 dB SNR on their third trial. On the fifth trial the total improvement was of -2.9 dB SNR. On the second session, participants retained the same SRT scores as in their last trial on the first session. They further improved their SRTs on the trial of the retest session. This improvement was of -3.8 dB SNR relative to their first trial on the first session. In Figure 6 it can be seen that there was a consistent difference of ~1.8 dB SNR between the SRT means of the open-set and closed-set trials

Speechreading and Audiovisual Benefit
Participants had a wide range of speechreading abilities. The individual VONoiseClosed scores ranged from 0 to 84% intelligibility, had an average of 50% and a standard deviation of 21.4 %. Figure 7 shows the distribution of the visual-only scores. Regarding the test and retest session scores, there was an intelligibility improvement of 4.2% in the retest session on average, although not all participants improved their scores. The intra-individual standard deviation between test and retest scores was 9.4 %. Speechreading scores were correlated with the audiovisual benefit (i.e., the SRT difference between audiovisual and audio-Development and Evaluation of Video Recordings for the OLSA Matrix Sentence Test Preprint only condition) of the participants. This correlation can be seen in Figure 8, where the visual-only scores are plotted against the individual SRTs in different conditions. The Pearson's r correlation scores were -0.66 (p<0.001), -0.69 (p<0.001), -0.65 (p<0.001) and -0.65 (p<0.001) for AVNoiseClosed, AVNoiseOpen, AVQuietClosed and AVQuietOpen respectively. Participants that were good speechreaders gained more from having visual information in the audiovisual trials. Whether participants were trained in open-set or close-set formats did not make any difference on the audiovisual benefit.

Audio-only and Audiovisual SRTs
The average SRT differences between audio-only and audiovisual trials were 5.0 dB SNR and 7.0 dB SPL for speech in noise and in quiet respectively. The audio-only in noise conditions were the most consistent across participants (std 1.8 dB). The less consistent conditions across participants were the audiovisual trials (std 3.9 dB) and the audio-only speech in quiet trials (std 2.8 dB). The PTA of the listeners was not significantly correlated to the audio-only in quiet scores (Pearson's r = 0. 15, p=0.11). When looking at the intra-subject differences (test minus retest SRTs), we found audio-only SRTs to be more consistent from test to retest and that audiovisual SRTs still improved on average around 1 dB on the retest. The average intra-subject differences and standard deviations were 0.82 dB (std 2.02 dB), 0.42 dB (std 2.16 dB), 1.39 dB (std 2.34 dB), 1.79 dB (std 4.11 dB) for audioonly in noise, audio-only in quiet, audiovisual in noise and audiovisual in quiet. The response format (open or closed) did not have an effect on the scores with the exception of the audiovisual in noise trials. In those trials, the closed-set response format reached slightly lower SNRs (~0.5 dB). Furthermore, the training response format affected the AVNoiseOpen results: participants that were trained with the closed-set format, thus who knew the complete set of words, reached an average SRT of -13.8 dB SNR (std 4.0 dB), whereas the ones who were trained using the openset format only reached an average SRT of -12.1 dB SNR (std 2.7 dB). Mean SRTs and standard deviations of the test and retest trials are shown in Table 2.

Validity of the Video Material
Overall, the audiovisual speech intelligibility scores were high enough to suspect that there were no asynchronies in the video material [12] [13]. Nevertheless, sentences and words should be checked individually for abnormal intelligibility scores that could be due to asynchronies. These asynchronies could be further corrected and adjusted by editing the video material. Participants were not specifically asked to detect asynchronies in the audiovisual material during the evaluation, but as far as the authors know none reported any temporal artifacts.
The audio-only and audiovisual scores found were similar to the expected by the literature. The reference SRT value for the female OLSA is -9.4 dB SNR at 50% intelligibility [15] and we found an SRT of -8 dB SNR at 80% intelligibility. At higher SNRs speech is easier to understand, thus this 1.4 dB difference represents the difference in speech intelligibility (from 50% to 80%). In [8] there was a difference of 3 dB between audio-only and audiovisual scores, whereas we found a difference of more than 5 dB in the equivalent conditions. This difference could be Preprint G. Llorach et al.
due to the specific speaker, as some are easier to speechread than others [24], or to language differences [3].
Speechreading scores were in concordance with the literature. In the Malay MST [6] participants reached scores from 84% to 25% word scoring with a mean of 57.1% when speechreading in quiet with closed-set response format. Similarly in the Dutch MST [8], the scores ranged from ~100% to 15% word scoring with the same condition. Therefore using our dubbed video recordings provides the same or similar results as video recordings with its corresponding audio. This is particularly relevant for using our method of dubbed visual speech in established audio-only speech tests.

Advantages of Dubbing
As mentioned in the introduction, one of the advantages of using existing audio-only material is that the validity of acoustic speech is kept. For example, in [8] a large variability in intelligibility across words was found probably because the word acoustic levels were not balanced and optimized as it is usually done in MSTs. Nevertheless, this fact does not mean that a nonoptimized MST is not usable. MSTs without level adjustments [25] are used in research and can evaluate speech recognition thresholds with almost the same precision.
Another advantage is the reduction of effort and complexity of the recording procedure. When starting from a limited number of sentences (150) the recording procedure is simpler and faster. In [6] they created all possible 100.000 sentences by re-mixing 100 recorded sentences, as they did not have a final selection of sentences. During the recording session they had to ensure that the speaker's head was in the same physical position for the videos to be cut and blended without artifacts. For this purpose they had to fabricate a head-resting apparatus to keep the head in the same position. The material required an additional evaluation step to validate the re-mixed recordings, resulting into 600 final sentences. If they would have had to record only 150 final sentences, as in this study and in [8], the recording procedure would have been shorter and simpler. Virtual characters would be another possible solution to create visual speech, offering more flexibility and control. Ideally the virtual character's lip-syncing should achieve the same intelligibility scores as the videos of real speakers. In [10] the German MST with virtual characters was used. CI and NH participants achieved 37.7% and 12.4% word scoring in the visual-only condition respectively, which is below the expected average scores when using real speakers. In [9] an SRT improvement of only 1.4 dB SNR was found with virtual characters while we found a 5 dB SNR improvement. Nevertheless the speech material in the last study was different and thus can not be directly compared.
We would like to introduce audiovisual MSTs as tools to evaluate lip-syncing animations of virtual characters. Most current research in lip animation and visual speech does not consider human-computer communication and speech understanding in their evaluation procedures [26] [27]. For this purpose, our videos are published in an open repository [REFtoZenodo] in the hope that other lip-syncing tools also consider speech understanding as a measure.

Speechreading
The ceiling and floor effects found in the audiovisual MST are because the visual speech was easy to speechread. One of the reasons behind this could be that the limited set of words is easy to learn, to differentiate visually and to speechread. In sentences without previous knowledge of context one would expect lower speechreading scores, around 30% [31] [38]. Nevertheless it can be discussed that having some content expectations is probably closer to a real-life conversation, where the conversational context is usually known.
Another possible factor is that the female speaker was easy to speechread, as there can be differences between speakers. For example in [24] young female speakers were judged to be easy to speechread. We did not make a selection of speakers as we wanted to have the same person that recorded the audio-only MST. Female speakers have been recommended as a compromise between the voice of an adult male and a child [28], thus it was a good starting point. Selecting speakers that are more difficult to speechread would probably reduce the ceiling and floor effects.
An interesting alternative to audiovisual MSTs would be to develop a viseme-balanced MST. The audio-only MST is designed to be phonetically balanced, but this does not mean that the visual speech is balanced, as each phoneme does not necessarily correspond to a viseme [29]. In [30] some effects of the visual cues on the word intelligibility and word error for the AV-OLSAf were found, showing that acoustic and visual speech provide different information. Therefore it could be that the visual speech found in the current MST sentences is not representative of the language tested. Language-specific viseme vocabularies [31] should be developed for this purpose.
That the audiovisual trials were correlated to the speechreading scores was expected [32] [33] [8]. The better the participants could speechread, the less acoustic information they needed to understand speech. This correlation was present in noise and in quiet conditions, thus the audiovisual benefit was resilient to this acoustic condition.

Training Effects
The training effects found in the audiovisual MST were bigger than those found in the audio-only MSTs. It is usually expected a 2 dB SNR improvement after 4-5 lists in audio-only MSTs at 50% speech reception [15]. We found a ~3 dB SNR improvement at 80% speech reception, thus this additional dB probably appeared due to learning to speechread the material and getting familiarized Development and Evaluation of Video Recordings for the OLSA Matrix Sentence Test Preprint with the speaker [16]. Nevertheless, this training effect of the MST was not reported in the audiovisual Dutch MST [8] after a familiarization phase with the complete set of words and a training list of 10 audiovisual sentences.
The audiovisual training worked as training for the audio-only trials too. In our experiment, participants reached SRTs of -8.04 dB SNR for 80% speech reception. If there was no training effect their SNRs would be higher. For example, in [15] participants scored 50% SRT on their first list for an SNR of -6.5 dB.

Individual Differences
The large variability found in audiovisual trials can be explained by the individual speechreading abilities and by the intra-subject differences. Speechreading scores were highly individual and the participants performed differently in the test and retest trials (standard deviation of 2 dB). Therefore, when testing a listener in audiovisual conditions, a standard deviation of 2 dB between trials can be expected. In the case of audio-only speech in noise SRTs, we found little variability among individuals, which was expected as all participants were young and did not have any hearing disability [35].
The variability in audio-only in quiet trials was not explained by the PTAs. Hearing thresholds and noise-induced hearing loss are usually correlated to speech in quiet scores: the worse the hearing levels, the worse the speech intelligibility in quiet [34]. Nevertheless, we did not find this correlation in our study, probably because the individual PTAs were similar and we did not include hearing-impaired participants.
Audiovisual MSTs are particularly relevant for testing severely to profound hearing-impaired listeners in the clinic. These listeners cannot perform audio-only intelligibility tests and therefore the audiovisual MST would be useful to investigate whether a hearing aid or cochlear implant provision improves their audiovisual speech comprehension. Further research should evaluate the AV-OLSAf with hearing-impaired and elderly participants, as some effects are expected: hearing-impaired listeners tend to be better speechreaders [36], and the ability to speechread decreases with age [37].

•
Dubbed videos with established audio-only MSTs reached the same performance as audiovisual recordings would have. The method presented here keeps the validity of the original audio material while introducing concordant visual speech.
• The audiovisual MST suffers both from ceiling and floor effects, which are closely related to the speechreading abilities of the participant. These effects should be considered when designing experiments for audiovisual perception. High target SRTs such as 80% SRT would be more recommended instead of 50% SRT in adaptive procedures.
• Audiovisual stimuli gave an SRT benefit of 5 dB SNR in test-specific noise and 7 dB SPL in quiet in comparison to audio-only stimuli for young normal-hearing participants. Reference values for 80% SRT found in this study were -13.2 dB SNR for audiovisual speech in noise and 10.7 dB SPL for audiovisual speech in quiet.
• At least three to four training lists should be done to reduce the training effects. Participants may still improve their audiovisual SRTs even in the retest session, thus training effects may still be present after a certain number of training trials. More than one trial should be done to properly evaluate an audiovisual condition.
• There was a greater variability in audiovisual SRTs than in audio-only SRTs. This was due to the great variability in speechreading abilities, which were correlated to the audiovisual SRTs.