The effect of stimulus duration on preferences for gain adjustments when listening to speech

Objectives In the personalisation of hearing-aid fittings, gain is often adjusted to suit patient preferences using live speech. When using brief sentences as stimuli, the minimum gain adjustments necessary to elicit consistent preferences (“preference thresholds”) were previously found to be much greater than typical adjustments in current practice. The current study examined the role of duration on preference thresholds. Design Participants heard 2, 4 and 6-s segments of a continuous monologue presented successively in pairs. The first segment of each pair was presented at each individual’s real-ear or prescribed gain. The second segment was presented with a ±0–12 dB gain adjustment in one of three frequency bands. Participants judged whether the second was “better”, “worse” or “no different” from the first. Study sample Twenty-nine adults, all with hearing-aid experience. Results The minimum gain adjustments needed to elicit “better” or “worse” judgments decreased with increasing duration for most adjustments. Inter-participant agreement and intra-participant reliability increased with increasing duration up to 4 s, then remained stable. Conclusions Providing longer stimuli improves the likelihood of patients providing reliable judgments of hearing-aid gain adjustments, but the effect is limited, and alternative fitting methods may be more viable for effective hearing-aid personalisation.


Introduction
In the treatment of hearing loss, clinicians fit hearing-aids to reach a balance between audibility and comfort for each patient. The balancing begins with prescribed gains across frequency based on each patient's pure-tone thresholds. These prescribed gains, based on average data, are then personalised through adjustments made by the clinician using patient feedback (Anderson, Arehart, and Souza 2018;Jenstad, Van Tasell, and Ewert 2003;Kuk and Ludvigsen 1999;Thielemans et al. 2017). The patient's feedback is often based solely on the effect the adjustments have on the perception of the clinician's voice, the most readily available stimulus in any clinic.
We have previously investigated what gain adjustments are discriminable for short sentences presented in quiet. Median just-noticeable differences (JNDs) for increases in gain (increments) in broad low-, mid-and high-frequency bands were 4, 4, and 7 dB, respectively (Caswell-Midwinter and Whitmer 2019). Gain adjustments less than these JNDs will, on average, not be readily perceived. A clinician may still receive feedback from a patient, but such feedback may not be based on the auditory perception of these adjustments, but other factors (cf. placebo effects without adjustment; Bentler et al. 2003;Dawes, Hopkins, and Munro 2013;Naylor et al. 2015). Using the same speech corpus, we have subsequently investigated what gain adjustments are necessary to elicit consistent preferences (Caswell-Midwinter and Whitmer, 2021). Median preference thresholds, the minimum adjustment to elicit a preference, ranged from 4 to 12 dB for gain decrements and 5-9 dB for increments in the same broad low-, mid-, and high-frequency bands. In Caswell-Midwinter and Whitmer (2019), it was posited that the greater JNDs for speech in quiet than for speech-shaped noise were due to the spectrotemporal sparsity of the speech. That is, for a given gain adjustment in any given band, the clean speech signal provided a smaller number of glimpses of the adjustment than same-spectrum noise. In Caswell-Midwinter and Whitmer (2021), it was further speculated that the large preference thresholds were due in part to the short duration of the stimuli. The current study tested this by measuring preference thresholds for gain adjustments across various stimulus durations. Although patients typically make quick comparisons on adjustments in the clinic, audiologists may talk for longer, which might elicit more frequent and reliable preferences.
Previous psychophysical research provides some evidence that speaking longer would lead to more consistent preferences: level discrimination improves with increasing duration, albeit mostly limited to short pure-tone stimuli. Increasing the duration of a 0.25, 1, or 8-kHz tone up to 0.5, 1, or 2 s, respectively, can improve level discrimination for normal-hearing listeners (Florentine 1986). Further, duration can improve pure-tone level discrimination in fixed and roving pedestal level but not acrossfrequency conditions (Oxenham and Buus 2000). For the discrimination of a tone's relative level within a complex (i.e. profile analysis), performance improves up to a duration of at least 100 ms Dai and Green 1993). The ability to discriminate a gain adjustment in particular band(s) of speech bears partial resemblance to increment detection, the detection of an increase or "bump" in the level of an ongoing sound. Valente, Patra, and Jesteadt (2011) showed that increasing the duration of an ongoing 0.5 or 4.0-kHz tone increased the detectability of a time-centred bump in the tone's level more so than increasing the duration of the bump. There is some evidence of a duration effect with broadband stimuli: studying the detection of an 8-dB peak at 3.5 kHz in a broadband noise, Farrar et al. (1987) found that thresholds decreased as duration increased up to 300 ms, the maximum duration tested. Isarangura et al. (2019) found that the detection of spectral modulation in a broadband noise carrier also improved with increasing duration but reached asymptote by 200 ms. For speech stimuli, measures of duration effects on level discrimination are scant; in a study of overall level discrimination of speech, the threshold for words (mean duration 450 ms) was only significantly worse (greater) than for sentences (mean duration 1533 ms) when participants were aided (Whitmer and Akeroyd 2011).
In sound-quality evaluations such as comparing hearing-aid settings, a balance must be struck in sound-sample duration. The sample must be long enough to allow perception of the acoustic changes, but short enough to allow comparison of the adjusted sound with the previous (reference) sound. The International Telecommunication Union (ITU) recommendations for subjective sound-quality evaluations note that, for paired comparisons, durations should not exceed 15-20 s due to "short-term human memory limitations", but can be "a few seconds" (International Telecommunication Union, Radiocommunication Sector 2019, p. 6; cf. Cowan 1984). These memory limitationsthe ability to maintain features of the first sound for comparison to the second are often measured by assessing the effect of the inter-stimulus interval (ISI) behaviourally (Pollack 1972;Winkler and Cowan 2005) or physiologically (Bartha-Doering et al. 2015). In the clinic, the adjustment is often done without any gaps other than the natural pauses in ongoing speech. The memory limitation for comparing ongoing stimuli has previously been modelled as an exponential decay over many seconds, albeit for pure-tone stimuli (Durlach and Braida 1969;Massaro 1970). Despite qualitative recommendations and a long history of auditory memory research (cf. Cowan 1984), the effect of duration on preferences for speech stimuli, as assessed in the clinic during hearing-aid adjustments, is not known.
On the basis of the foregoing evidence, we hypothesised that increasing the duration of the stimuli would elicit more consistent and reliable preferences for gain adjustments. The current study used most of the same methods, including most of the same participants, as Caswell-Midwinter and Whitmer (2021) did when measuring preferences for gain adjustments. The main difference is the primary experimental contrast: stimulus duration. To avoid potential memory confounds, the maximum stimulus duration was 6 s (cf. International Telecommunication Union, Telecommunication Standardization Sector 2003); the minimum was 2 s (vs. 0.855-2.3 s in the previous study). To better mimic elements of a clinical session, there were five other methodological differences. First, the stimuli were consecutive segments from a continuous story instead of repeated (within a trial) sentences. Second, the gain adjustment was always made for the second stimulus on each trial, rather than randomised. Third, the number of gain steps was reduced from six (±4, 8, and 12 dB) to four (±6 and 12 dB). Fourth, there was no ISI. Finally, given the lack of agreement or reliability in using descriptors (e.g. "tinny") to describe the effect of a gain adjustment reported by Caswell-Midwinter and Whitmer (2021), the current study only measured preferences.

Participants
Twenty-nine adults (14 female) were recruited from a sample who had previously participated in a gain-discrimination experiment (Caswell-Midwinter and Whitmer 2019). The median age was 68 years (range 51-74 years). The median better-ear four-frequency (0.5, 1, 2, and 4 kHz) pure-tone threshold average (BE4FA) was 35 dB HL (range 12-56 dB HL; see left panel of Figure 1. The left panel shows median pure-tone thresholds as a function of frequency (circles, solid line) and interquartile ranges (error bars), with the individual thresholds for the three lowest and highest average thresholds (dotted lines). The right panel shows median sensation levels (approximated from pure-tone thresholds and applied gain) as a function of frequency (circles, solid line) and interquartile ranges (error bars), with the individual values for the three lowest and highest average sensation levels (dotted lines). Figure 1). None of the participants had a conductive loss (i.e. all participants' average air-bone threshold differences were less than 20 dB; British Academy of Audiology 2016).
For the 19 participants who habitually wore hearing-aids at the time of the study, the real-ear insertion gain provided by their hearing-aids in their better ear was measured with 65 dB broadband noise input (ICRA URGN-M-N; Dreschler et al. 2001) and used as their gain prescription. For the ten participants who were not currently wearing hearing-aids, linear NAL-R gain prescriptions (Byrne and Dillon 1986) for their better ear were used. Sensation level (SL) of the stimuli was approximated from pure-tone thresholds and applied gain; the median sensation level for amplified stimuli, averaged across 0.5, 1, 2, and 4 kHz, was 35 dB SL (range 15-51 dB SL; see right panel of Figure 1). All participants had previously been fit with hearingaids; the median hearing-aid experience was 10 years (range 2-35 years). Twenty-six of the 29 participants took part 18 months earlier in the preference experiment with short sentences (Caswell-Midwinter and Whitmer 2021).
All participants had also performed visual letter and digit monitoring tasks during a previous study (at least 18 months prior to the current study) to provide an estimate of their cognitive abilities (specifically working memory; Gatehouse, Naylor, and Elberling 2006). The tasks involved identifying triplet digit and letter sequences at two different ISIs (1 and 2 s); a full description is in Caswell-Midwinter and Whitmer (2019). The resulting d 0 measures were averaged across digit and letter tasks and ISIs to give a single cognitive score.

Stimuli
The stimuli were consecutive segments of a Sherlock Holmes story read by a professional male actor with a Southern English accent ("The Naval Treaty"; Doyle 2011). The original stimuli were converted from stereo to mono and resampled to 24 kHz from an original sample rate of 44.1 kHz. Any silent gaps greater than 250 ms were truncated to 250 ms. On each trial, two consecutive segments were presented to the participants' better ear, both with the same duration of either 2, 4 or 6 s. For each segment, 50-ms linear onset and offset ramps were applied. To better mimic adjustments in the clinic, the standard stimulus was always the first stimulus in the pair, and there was no ISI beyond the offset and onset gating.
For the standard stimulus, real-ear or prescribed gain was applied across six frequency bands: a low-pass band with an upper cut-off of 0.25 kHz, four octave bands centred at 0.5, 1, 2, and 4 kHz, and a high-pass band with a lower cut-off of 6 kHz. For the target stimulus, additional gain (DGain) of either À12, À6, 0, þ6, and þ12 dB was applied in one of three broad frequency bands: a low-frequency band combining 0.25 (low-pass) and 0.5 kHz (octave) bands (LF), a mid-frequency band combining 1 and 2 kHz octave bands (MF), and a high-frequency band combining the 4 kHz and 6 kHz (high-pass) bands (HF). Stimuli were generated by convolving each segment with a 140-tap finite impulse response filter optimised for NAL-R equalisation at 24-kHz sample rate by Kates and Arehart (2010). The overall longterm A-weighted presentation level was 60 dB SPL to approximate in-quiet conversational level (Olsen, 1998). The presentation level was verified with an artificial ear and sound level metre (Br€ uel & Kjaer 4152 and 2260), prior to any prescription or gain adjustment. The audibility of the segments was confirmed with each participant after the first trial.
We additionally analysed the effect of the natural variation in power within bands across the consecutive segments of each trial (i.e. when DGain ¼ 0). There were significant mean absolute level differences within bands between the two segments in any given trial as a function of both frequency band and segment duration [F(2,56) ¼ 13.06 and 19.41, respectively]. The differences, however, were small; absolute differences in band-specific level increased from 0.2 dB for the LF band to 0.3 dB for MF and HF bands [t(28) ¼ 4.76; p ( 0.001], and absolute level differences decreased from 0.3 to 0.2 to 0.1 dB when the duration increased from 2 to 4 to 6 s, respectively [t(28) ¼ À2.58 and À4.39; p ¼ 0.015 and 0.0002, respectively].

Procedure
Participants were seated in a sound-isolated booth (IAC Acoustics), and listened to the stimuli through circumaural headphones (AKG K702) without hearing-aids. The change in stimulus within each trial from first to second segment was indicated on a touch screen in front of the participant. Participants were asked on each trial to indicate "How did the second sound compare to the first sound?" by selecting either the "better", "worse" or "no difference" button on the touch screen.
There were three segment durations (2, 4 and 6 s) and 13 gain adjustments (±6 and ±12 dB adjustments in the LF, MF and HF bands plus a no-adjustment control), resulting in 39 stimulus conditions. Each stimulus condition was repeated ten times, resulting in 390 trials (3 Â 13 Â 10). The order of presentation was randomised for each participant. The trial run was broken into equal blocks of 130 trials with breaks between. Prior to testing, each participant completed 12 practice trials consisting of one trial each of 2-s and 6-s segments with ±12 dB gain adjustments in each of the three bands.
Ethical approval for the study was given by the West of Scotland research ethics committee (18/WS/0007) and NHS Scotland R&D (GN18EN094). All participants provided written informed consent prior to testing.

Preferences
The proportions of "better" (B), "worse" (W) and "no difference" (ND) judgments were calculated for each gain adjustment in each frequency band (Figure 2). A repeated-measures analysis of variance (RMANOVA) was run on the entire dataset (5 gain adjustments Â 3 frequency bands Â 3 segment durations) using combined "better" and "worse" proportions [P(B or W)] as the dependent variable (Table 1). Amount of gain adjustment, frequency band and duration all showed significant main effects on better-and-worse preferences. Better and worse judgments increased with increasing duration, from 2 to 4 s [t (28) ¼ 8.44; p ( 0.001] and 4 to 6 s [t (28) ¼ 2.80; p ¼ 0.0092]. The greatest rates of "better" and "worse" responses were for LF adjustments.
As the current methods shared many aspects, including participants, with Caswell-Midwinter and Whitmer (2021), the current study's preference data were compared to the preferences elicited for short sentences in that previous study (grey triangles and dotted lines in Figure 2). In the current study there were more "better" and less "worse" ratings for þ12-dB adjustments in the MF band [t(59) ¼ 3.11 and À3.10 for better and worse, respectively; Holm-Bonferroni corrected p 0 ¼ 0.0028 and 0.0030] and HF band [t(59) ¼ 5.32 and À3.77, respectively; both p 0 <0.001]. There were also more "better" and less "worse" ratings for the LF band for þ12 dB adjustments in the current study compared to the previous (compare grey with coloured triangles in the left panel of Figure 2), but these differences were not statistically significant [t (59) ¼ 1.99 and À1.60; both p > 0.05].
Participants were less prone to choose "no difference" when there was no gain adjustment in the current study compared to the previous study. The proportion of no difference responses at DGain ¼ 0 was 0.84 across segment durations compared to 0.94 previously for short sentences [t(56) ¼ 3.31; p ¼ 0.0017].

Preference thresholds
The minimum gain adjustment required to elicit either a "better" or "worse" preferencethe preference thresholdwas estimated by fitting a logistic function to each individual's P(B or W) as a function of DGain. Separate functions were fitted for negative and positive gain adjustments (i.e. decrements and increments) for each frequency band. The threshold was defined as P(B or W) ¼ 0.55 [P(ND) ¼ 0.45] which corresponds to d 0 ¼ 1 for an unbiased differencing observer in a same-different discrimination task (Macmillan and Creelman 2005). Shapiro-Wilk tests of normality were violated for three of the 18 conditions: 4-s and 6-s LF increment and 2-s MF decrement thresholds (W ¼ 0.91, 0.87 and 0.88, respectively; p ¼ 0.018, 0.0034 and 0.0064); nevertheless, Tukey boxplots (Tukey, 1977) are used in Figure 3 to show the range of preference thresholds for each condition. All statistical probabilities reported for pairwise comparisons and correlations were corrected for multiple comparisons using the Holm-Bonferroni method (Holm 1979); corrected probabilities are indicated by p 0 .
An RMANOVA based on the preference thresholds showed main effects of frequency band, direction of gain adjustment and segment duration (Table 2). Preference thresholds decreased with increasing segment duration, increased with increasing centre frequency and were greater for decrements than increments. There was a significant interaction of frequency band and gain direction; decrement thresholds increased more than increment thresholds with increasing centre frequency. There was also a significant albeit modest (g 2 ¼ 0.11) interaction between gain direction and duration; preference thresholds decreased with increasing duration more for increments than decrements. There was additionally a significant but modest three-way interaction in the RMANOVA: preference thresholds for the MF band decreased with increasing segment duration more for decrements than for increments.
Mean thresholds with 95% repeated-measures confidence intervals (Loftus and Masson 1994) are shown in Table 3. Thresholds significantly decreased with increasing duration for gain increments in the LF, MF and HF frequency bands, and for gain decrements in the LF and MF bands; the thresholds for decrements in the HF band (12.1 dB) did not significantly change across durations. The overall rate of change in preference threshold (i.e. the difference in mean thresholds not including HF decrements divided by the difference in duration) decreased with increasing duration from À0.8 dB/s at 4 s to À0.4 dB/s at 6 s. That is, preference thresholds decreased more between 2 and 4 s than between 4 and 6 s.
The preference thresholds measured here for 2-s consecutive segments of a continuous story were similar to the thresholds for short sentences reported by Caswell-Midwinter and Whitmer (2021) with the exception of MF and HF decrements, for which the current thresholds were significantly greater (t ¼ 2.75 and 2.49; p 0 ¼ 0.011 and 0.030, respectively). Thresholds for 2-s stimuli, averaged across frequency bands, were positively correlated with thresholds in the previous study for both increments and decrements (q ¼ 0.55 and 0.72, respectively; both p 0 ( 0.001). Preference thresholds were not correlated with age, BE4FA, or hearing-aid experience (all p 0 > 0.05). HF increment preference thresholds were positively correlated with HF pure-tone thresholds (q ¼ 0.48; p 0 ¼ 0.049), and negatively correlated with HF sensation level (q ¼ À0.50; p 0 ¼ 0.038) and cognitive score (r ¼ Figure 2. Mean proportion of preferences as a function of gain adjustment for low-frequency (LF; 0.5 kHz), mid-frequency (MF; 1-2 kHz) and high-frequency (HF; !4 kHz) bands (left, middle and right panels, respectively) for 2-s, 4-s and 6-s durations (short-dashed, long-dashed and solid lines, respectively; red, green and blue online). Better, worse and no difference preferences are shown as upward triangles, downward triangles and circles, respectively. Grey dotted lines and symbols show results using short sentences from Caswell-Midwinter and Whitmer (2021). Degrees of freedom (df) and probabilities (p) reflect Greenhouse and Geisser (1959) corrections for non-sphericity.
À0.62; p 0 ¼ 0.0020). Individual 2-s preference thresholds were correlated with individual decreases in threshold with duration, characterised as the slope in dB/s (r ¼ À0.57; p 0 ¼ 0.0035). Individual 2-s, 4-s or 6-s preference thresholds were not correlated with individual cognitive scores (r ¼ À0.37, À0.13 and 0.03, respectively; all p 0 > 0.05), but slopes (dB/s) were correlated with cognitive scores (r ¼ 0.50; p ¼ 0.0057). Controlling for the variance shared with 2-s thresholds, individual slopes were still correlated with cognitive scores (r ¼ 0.38; p ¼ 0.047). That is, thresholds decreased more with duration (i.e. greater negative slope) for those with lesser letter/digit-monitoring ability. Based on this correlation, the RMANOVA of preference thresholds was re-run with centred cognitive scores as a covariate. As expected, the covariate reduced the error term, increasing the F statistics and g 2 effect sizes, but did not change the pattern of results shown in Table 2.

Preference agreement and reliability
Fleiss' j (Fleiss 1971) was used to measure inter-participant agreement, comparing participants' most frequent judgement (better, worse or no different) for each adjustment condition. To simplify the analysis, judgments were collapsed across adjustments for each direction and frequency band; judgments for the DGain ¼ 0 condition were not included in the analysis. Fleiss' j was 0.39 [0.36-0.42 95% confidence intervals (CI)], 0.50 (0.47-0.53) and 0.50 (0.47-0.53) for segments of 2-s, 4-s and 6-s duration, respectively, representing "fair" (2 s) and "moderate" (4 and 6 s) agreement. That is, agreement significantly increased from 2 to 4 s, but not from 4 to 6 s. A participant's judgments ("better", "worse" or "no difference") for a given gain adjustment in a given frequency band were considered reliable if seven or more of those judgments were identical, a reliability threshold based on binomial probability theory (Kuk and Lau 1995). Individual reliabilities were averaged across conditions; judgments for the DGain ¼ 0 condition were not included. Because the proportions of reliable preferences in the current study were not normally distributed based on Shapiro-Wilk tests (W ¼ 0.92, 0.90 and 0.92 for 2-s, 4-s and 6-s stimuli), non-parametric tests were used to compare reliability across conditions. Figure 4 shows individual proportions of adjustments with reliable preferences. Reliability increased significantly from a median value of 67% for short sentences and 2s segments to 75% for 4-s and 6-s segments [v 2 ¼ 11.10; p ¼ 0.011]. There was no significant difference in reliability between sentences and 2-s segments (z ¼ 0.65; p ¼ 0.51) nor between 4-s and 6-s segments (z ¼ 0.72; p ¼ 0.47). The percentage of participants with !90% reliable preferences, however, did increase from 14% at 4 s to 28% at 6 s. Individual reliabilities for short sentences and 2-s stimuli were not correlated, but reliabilities for 4-s and 6-s stimuli were (r ¼ 0.61; p ¼ 0.0004).

Discussion
By having participants compare and judge consecutive segments of a single-narrator story, we have shown that longer durations promote more frequent and reliable "better" or "worse" preference judgments for gain adjustments in broad frequency bands. That is, the gain adjustments required to elicit consistent preferences decreased with increasing stimulus duration. The proportions of better or worse preferences were greater, so preference thresholds were smaller, for increments than for decrements, in agreement with Caswell-Midwinter and Whitmer (2021) as well as previous psychophysical literature (Ellermeier 1996;Moore, Oldfield, and Dooley 1989;Moore et al. 1997). Better and worse preferences were less frequent with increasing centre frequency of the adjustment band, as previously shown for short sentences (Caswell-Midwinter and Whitmer 2021).
Despite differences in the method, the median preference thresholds in the current study for 2-s segments were similar to the thresholds for 1.6-s average duration sentences in our previous study (Caswell-Midwinter and Whitmer 2021), and individual preference thresholds were correlated with the previous thresholds. As with the previous study, the strongest preferences were for increased LF gain and against decreased LF gain, as Table 2. Results of a repeated-measures analysis of variance on preference thresholds (see Table 1 for description of terms).

Main effects
df (  found in self-fitting studies (Keidser and Convery 2018;Nelson et al. 2018;Vaisberg et al. 2021). The long-term spectrum of the stimuli had its greatest power in the LF band; this may have influenced the discriminability of LF adjustments (Jesteadt et al. 2017), increasing preferences and reliability. There were preference differences between the two studies, with increases in "better" vs. "worse" judgments for MF and HF increments in the current study. The long-term spectrum in the HF band for the current monologue segments was 5.6 dB less than for the previous sentence stimuli. Increases in HF gain may have then been judged more favourably in the current study because of the greater audibility in that band. There were, though, no spectral differences to explain the MF increment preference discrepancy; further work is needed to better understand to what extent particular stimulus attributes (e.g. vocal timbre) and context (e.g. monologue vs. unconnected sentences) affect gain preferences. Participants were less likely to respond "no difference" in the current study when consecutive segments were presented without gain adjustments compared to the previous study (Caswell-Midwinter and Whitmer 2021) where the same sentence was presented twice on each trial. This difference can be attributed to the comparison of two different speech segments; the naturally occurring differences in the spectrotemporal patterns between the two segments (without gain adjustments) could decrease the likelihood of a "no difference" response Kidd, Mason, and Green 1986). The effect of this decrease in nodifference responses on threshold estimation was minimal; fitting logistic functions to the current data using the no-difference responses from the previous study increased threshold estimates by only 0.4 dB on average. Nevertheless, the change demonstrates a limitation of using sequential stimuli for comparison.
The use of an ongoing story, as opposed to hearing the same utterance twice, anecdotally provided a greater degree of participant engagement with the material, engagement as might occur in the clinic, where the responses of the patient will affect real-world use. Any greater engagement with the stimulus content, however, may have been detrimental to performing the task. Beyond the decrease in no-difference responses, the effect of comparing different stimuli (two consecutive segments) versus comparing identical stimuli was small. Using non-repeating segments introduces variability in the level and spectrum in the comparison, which can decrease detectability (Kidd, Mason, and Green 1986), thus increasing preference thresholds. In the present experiment, the use of the same talker throughout would have reduced signal uncertainty and thus reduced any effect of non-repeating segments on thresholds. To check the potential influence of extreme spectral variations between segment pairs, preference thresholds were recalculated excluding the 10% of trials with the greatest absolute difference in any band for each participant. The only significant effects of this recalculation were modest increases in the preference thresholds for 6-s MF and 2-s HF increment stimuli (Dthreshold ¼ 0.2 and 0.3 dB; z ¼ 2.72 and 2.13; p ¼ 0.0065 and 0.032, respectively); all other threshold differences were not significantly different from zero (z ¼ 0.14-1.22; all p > 0.05). Further, excluding trials based on extreme variation between their consecutive segments did not have any effect on the rate of change of preference thresholds as a function of duration. Thus, there is scant evidence that the natural variation in the consecutive stimuli affected the pattern of results.
The delivery of stimuli used for appraisal by the patient in the clinic may be different to paired or sequential comparisons. Rather, the appraisal may take the form of a single interval. Single interval ratings of hearing-aid sound quality have shown moderate test-retest reliability (Narendran and Humes 2003) and good inter-rater reliability (Gabrielsson et al. 1990), but these studies used stimulus durations of 50-60 s. Using such long stimuli for clinical fine-tuning may not be feasible It is not known if durations > 6 s would provide even greater discriminability and more reliable preferences. While the thresholds across most conditions decreased significantly from 4 s to 6 s, the effect was small. The overall rate of change decreased from À0.8 dB/s between 2 and 4 s to À0.4 dB/s at 6 s, resembling the exponential decay in memory-based models of the effects of duration on pairwise comparison (e.g. Durlach and Braida 1969). There was a correlation between participants' monitoring-task cognitive scores and the rate of decrease in their preference thresholds with increasing duration. That is, the worse their cognitive scores, the stronger the effect of stimulus duration on preference thresholds. This suggests that there is a limit to the effect of duration in the judgement of gain adjustments, and further suggests that the greatest effect is for those with lesser cognitive capacity. The mean preferences were very similar for 4-s and 6-s stimuli (Figure 2), and there was no increase from 4 to 6 s in inter-participant agreement or intra-participant reliability ( Figure  4). It is therefore unlikely for thresholds to decrease, or reliability to increase, much further beyond the results here for 6-s stimuli (cf. Bartha-Doering et al. 2015). It is also not known how fast-  Table 3. Mean preference thresholds (dB) with 95% confidence intervals in brackets for all conditions ("-" ¼ decrements; "þ" ¼ increments) including mean data from Caswell-Midwinter and Whitmer (2021) acting compression, as delivered by many current hearing-aids, would affect results. The short-term variation in speech would interact with the compressor, potentially generating different preferences. The dynamic compression of speech, however, has previously not been found to have an effect on overall level discrimination of words and sentences (Whitmer and Akeroyd 2011), hence would not be expected to lead to more consistent preferences with duration. The improvement in thresholds and reliability with increasing stimulus duration was small relative to the thresholds and reliabilities themselves. Talking or presenting stimuli for 6 s to a hearing-aid wearer in the clinic would help elicit consistent preferences for adjustments, but those adjustments would still need to be large: 3-6 dB for increments, 5-12 dB for decrements. These thresholds are well above common troubleshooting adjustments, especially for adjustments at higher frequencies. A patient may indeed state an immediate preference when a smaller adjustment has been made, but such a preference should be treated with caution, as it may not be based on the acoustical percept of the adjustment, and is therefore likely to be unreliable. For the personalisation of hearing-aids in the clinic, it is therefore important not only to say more than a few words (e.g. "how's that sound?") immediately following an adjustment, but also to ensure that the adjustment is large enough to elicit a consistent effect. Given these constraints, alternative methods of fitting, such as selfadjustments (Mackersie et al., 2019 ;Nelson et al. 2018), which have resulted in similar gains to those prescribed and fit by a clinician (cf. Sabin et al. 2020), may be more viable for effective hearing-aid personalisation, although further study is warranted.