“Acoustic-driven oscillators as cortical pacemaker”: a commentary on Meyer, Sun & Martin (2019)

This is a commentary on a review article by Meyer, Sun & Martin (2019), “Synchronous, but not entrained: exogenous and endogenous cortical rhythms of speech and language processing”, doi:10.1080/23273798.2019.1693050. At the heart of this review article is the language comprehension process. Anchored at a psychoand neurolinguistic viewpoint, the article argues for the centrality of endogenous cortical rhythms, not only as the facilitators of processes that generate abstract representations and predictions of language but also of processes that establish intrinsic synchronicity with the acoustics, with the priority to override processes realized by acoustic-driven, exogenous cortical rhythms. In this commentary I propose that the scaffold for the speech decoding process – through parsing – is an acoustic determinant. Whether oscillation driven or not, the decoding process is paced by a hierarchical cortical clock, realized by oscillators locked to the input rhythm in multiple Newtonian-time scales, keeping the decoding process in sync with the linguistic information flow. Only if such a lockstep is secured can reliable decoding proceed. ARTICLE HISTORY Received 21 January 2020 Accepted 19 February 2020


Prelude
The review article "Synchronous, but not entrained: exogenous and endogenous cortical rhythms of speech and language processing" (Meyer et al., 2019) examines the possible role of cortical rhythms in the language comprehension process, end-to-end. This process encompasses two distinct processes: (i) a speech process, which maps the acoustics into abstract representation of linguistic units, and (ii) a language process, which uses these units to derive language features, including syntax and sentence-level semantics. The authors argue for the centrality of endogenous cortical oscillators, not only at the core of the language process but also with the priority to override processes with acoustic-driven cortical oscillators at their core. My commentary concludes that, for reliable language comprehension, both the speech process and the language process must operate within cortical time units (CTUs) determined by the acoustics. How did I arrive to this conclusion?

Role of oscillatorscurrent view
Speech (everyday speech, in particular) is inherently a quasirhythmic phenomenon in which the talker's linguistic information is transmitted in "packets", manifested in the acoustic signal in the form of temporal "chunks". Oscillation-based models of the speech process postulate a cortical computation principle by which, the decoding process is performed on acoustic chunks defined by a timevarying window structure synchronised with the input on multiple time scales. In the following we shall exemplify this computation principle with TEMPO (Ghitza, 2011), a model which epitomises recently proposed oscillation-Glossary based models of speech perception (e.g. Ahissar & Ahissar, 2005;Ding & Simon, 2009;Ghitza & Greenberg, 2009;Giraud & Poeppel, 2012;Gross et al., 2013;Peelle & Davis, 2012;Poeppel, 2003).
The model is shown in Figure 1. The sensory stream (generated by a model of the auditory periphery, e.g. Chi et al., 1999;Messing et al., 2009) is processed, simultaneously, by a segmentation path and a decoding path (upper and lower paths of Figure 1, respectively). Conventional models of speech perception assume a strict decoding of the acoustic signal. 1 The decoding path of TEMPO, which links acoustic chunks of different durations with stored linguistic memory patterns, conforms to this notion. Not present in conventional models is the segmentation path, which determines the acoustic chunks (their location and duration) to be decoded. As it turns out, segmentation plays a crucial role in explaining a range of counterintuitive psychophysical data that are hard to explain by the conventional models (e.g. Ghitza & Greenberg, 2009;Ghitza, 2012Ghitza, , 2014Ghitza, , 2017. In TEMPO, the segmentation path is realised by an array of flexible oscillators locked to the input rhythm. In the pre-lexical level of TEMPO, the segmentation process is realised by a flexible theta oscillator locked to the input syllabic rhythm, where the theta cycles constitute the syllabic windows. A theta cycle is set by an evolving phase-locking process (e.g. a PLL circuit, Ahissar et al., 1997;Viterbi, 1966), during which the code is generated. Doelling et al. (2014) provided magnetoencephalography (MEG) evidence for the role of theta, showing that intelligibility is correlated with the existence of acoustic-driven theta neuronal oscillations.
In the phrase level, the segmentation process is realised by a flexible delta oscillator locked to the input phrase-chunk rhythm, where the delta cycles constitute the phrase-chunk windows. A delta cycle is set by an evolving phase-locking process, during which contextual parsing proceed. Rimmele et al. (2020) provided MEG evidence for the role of acoustic-driven delta, showing that the accuracy of digit retrieval is correlated with the existence of acoustic-driven delta neuronal oscillations.

Role of oscillatorsa broader look
As seen in Section 2, the functional role of the acousticdriven theta and delta oscillators is to facilitate a timevarying window structure, synchronised with the input, where the theta/delta cycles determine the syllable/ phrase chunks to be decoded. In this Section, a broader functional role for the acoustic-driven theta and delta is  (Viterbi, 1966;Ahissar et al., 1997; see also the biophysical computational model by Pittman-Polletta et al., 2020), with quasiperiodic oscillations that are locked to the quasi-rhythmic acoustic syllable-and phrase-chunks. (ii) The decoding path. Decoding is steered by segmentation: the decoding process evolves within the theta/delta cycles. See Figures 2 and 3 for the sequence of operations on the syllable and phrase levels, respectively. postulated, namely, they constitute an internal, hierarchical clock that pace the speech decoding process to stay in sync with the linguistic information flow, via keeping the decoding process operating on acoustic chunks aligned with proper linguistic units. In the following, I shall outline the rationale for this postulate.
In Figure 2, the sequence of operations that are executed in mapping the acoustic stream onto a series of syllable objects is outlined in more detail. First is the segmentation process, in the form of acoustic-driven theta cycle, set by an evolving phase-locking process 2 (step 1 in Figure 2). While the theta cycle is evolving, a neural code for the syllable chunk is generated throughout the theta cycle, e.g. in the form of gamma nested in theta 3 (step 2). The code is transmitted at the end of the theta cycle (step 3), then recognised (i.e. a working memory storage is activated) during the next theta cycle (step 4). An additional functional role of thetabeyond the setting of the theta windowemerges: the end-time of the theta cycle marks the moment at which the code is transmitted, i.e. it marks the moment by which the code generation must end. This is a necessary condition because, beyond this moment, the codegeneration circuitry should already be occupied with the generation of the code for the next theta chunk.
Turning to the phrase level, the sequence of operations that take place in mapping the stream of syllable objects onto phrase constituent candidates (PCCs) is shown in Figure 3. Segmentation comes first, in the form of acoustic-driven delta set by an evolving phase-locking process 4 (step 1 in Figure 3). While the delta cycle is evolving, PCCs are obtained by a parsing process that take place throughout the delta cycle (step 2). The PCCS are multiplexed at the end of the delta cycle (step 3). An additional functional role of delta emerges, analogous to that of theta: the endtime of the delta cycle marks the moment by which the PCCs must be delivered. Three points merit discussion. First, while we find cortical oscillations with cycle durations that correspond to syllables and phrases (theta and delta), we do not have oscillations that correspond to words. Indeed, there is no compelling linguistic evidence that words are regular enough for phase locking. Therefore, in TEMPO, the lexical access process operates on the syllable stream without any segmentation-based supervision (see, for example, the model TRACE, Luce & McLennan, 2005). Second, in generating the PCCs, numerous computation strategies can be considered (e.g. template matching; statistical pattern recognition; predictive coding; inference Bayesian approach; analysis-by-synthesis; relations via correlations; statistical learning). The  parsing process operates on words throughout the delta window and is not necessarily sequential, nor it is necessarily oscillation-based. 5 Important to our discussion, regardless of the computation strategy, in order to stay in lockstep with the input information flow the derivation of the PCCs must be concluded by the end of the delta window. And third, in deriving language features, including syntax and sentence-level semantics, the language process operates on a sequence of PCCs that span a few delta cycles. A few questionsbeyond the scope of this commentaryremain open, e.g.: how the duration of the language-process "window" is determined? Is it formed by a segmentation process realised by ultra-slow oscillators locked to the sentence-level information flow?

Cortical time units
The sequence of operations described in Section 3, in the syllable level and in the phrase level, is repetitive, irrespective of the theta/delta window durations. Functionally, therefore, the speech decoding process can be viewed as a process paced by an internal clock with uniform cortical time units (CTUs): (i) a theta CTU, with durationin Newtonian time 6of one theta cycle, and (ii) a delta CTU, with duration of one delta cycle. The CTUs are set by oscillators that are in sync with the input. As such, the CTUs, uniform in the internal domain, span nonuniform durations in Newtonian time (Figure 4, left). Crucially, the CTUs have a limited range, bounded in Newtonian time by the upper frequency range of the oscillators. Hence, the shortest duration of a theta CTU is about 125 ms (for theta max = 8 Hz), and the shortest duration of a delta CTU is about 0.5 s (for delta max = 2 Hz). Speech decoding, therefore, is viewed as a process that proceeds in uniform cortical-time ticks: at the syllable level, the entire sequence of operations in Figure 2 is executed in one theta CTU; at the phrase level, the entire sequence of operations in Figure 3 is executed in one delta CTU.
In Sections 4.1 and 4.2 we shall examine, through the internal clock prism, the resulting output of TEMPO when the input is speech at normal rate, and when it is accelerated. Recall that the intelligibility of time-compressed speech is flawless when the speech rate is inside the theta range, and is sharply deteriorated when the rate is outside theta (e.g. Foulke & Sticht, 1969;Garvey, 1953;Ghitza, 2014). As we shall see, as long as the input is at normal rate, the CTUs are aligned with acoustic chunks associated with syllables and phrases in their primitive sense, hence the internal clock and the linguistic information flow are in lockstep. When the input speech rate is too fast, the CTUs are no longer aligned with proper linguistic units, hence synchronisation is lost.

Input rate inside theta range (Figure 4, left)
In cortical time, syllabification and parsing proceed uniformly: a syllable object is generated and transmitted every theta CTU tick, and the PCCs are generated and multiplexed every delta CTU tick. Importantly, the derivation of the PCCs is concluded within one delta CTU, regardless of computation strategy.

Input rate too fast (Figure 4, right)
Two scenarios are considered: (i) the syllable-chunk rate is outside the theta range, but the phrase-chunk rate is inside the delta range, and (ii) the syllable-chunk rate is inside but the phrase-chunk rate is outside. In both scenarios there is a mismatch between the linguistic Figure 4. Newtonian time and cortical time, illustrated at the syllable level for normal rate (left) and fast speech (right). In both speeds, decoding proceeds uniformly in cortical time and syllable objects are transmitted one per theta CTU tick. In normal rate (left), the theta tracking is successful ⇒ a syllable chunk associated with a theta CTU is aligned with a syllabic unit. However, when the input rate is too fast (right, speech is time-compressed by 3) theta is "stuck" at upper frequency range ⇒ loss of tracking ⇒ acoustic chunks associated with the theta CTUs are no longer aligned with syllabic units.
information flow and the internal clock, resulting in a deterioration in performance.
In scenario (i), viewed in Newtonian time, since the syllable-chunk rate is outside theta range, the synchronisation between the acoustic stream and the theta oscillator is disrupted because the oscillator reaches its upper boundary. The oscillator is stuck at frequency theta max ⇒ erroneous segmentation, in both the location and the duration of the theta window ⇒ the acoustic chunk is no longer aligned with a syllabic unit ⇒ the stream of syllable objects is corrupted. Consequently, the resulting PCCs are in error. Viewed in Cortical time, objects are transmitted per theta CTU tick (each spans a duration of one theta max cycle, Newtonian time) but with error. The error in the syllableobjects stream affect parsing: indeed, the PCCs are emitted per delta CTU tick, in sync with the phrase-chunk rate but with a compromised accuracy due to the erroneous syllable-objects stream.
In scenario (ii), if the syllabic rate is inside the theta range, synchronisation on the syllable level is maintained and syllable objects are correctly recognised, one per theta CTU. However, synchronisation between the acoustic stream and the delta oscillator is disrupted because the oscillator is stuck at frequency delta max , resulting in erroneous segmentation in both the location and the duration of the delta window. Consequently, the PCCsemitted one per delta CTU tick (each spans a duration of delta max cycle, in Newtonian time)are in error.

Partial restoration of intelligibility
As we see, for fast speech the deterioration in intelligibility is the result of a mismatch between the internal clock and the information stream, such that the acoustic chunks associated with CTUs are no longer aligned with proper linguistic units. In order to restore intelligibility, the speech acoustics should be modified, in order to bring the input rate back inside the range of the internal clock. Two studies examined this approach: (i) in the syllable level, it has been shown that intelligibility is improved as a result of "repackaging"a process of dividing the time-compressed waveform into fragments, called packets, and delivering the packets in a prescribed rate determined by insertion of gaps in-between the packets (Ghitza & Greenberg, 2009;Ghitza, 2014;see Christiansen & Chater, 2016). The insertion of gaps is, in fact, a procedure of tuning the packaging rate in a search for a better synchronisation between the input information flow and the cortical clock, resulting in improvement in intelligibility. And (ii) in the phrase level, it has been shown that performance is impaired when the phrase-chunk presentation rate is outside the delta range, and that performance is restored by bringing the chunk rate back inside the delta range via inserting gaps in-between the chunks (Ghitza, 2017;Rimmele et al., 2020).

Summary
We claim that, from a functional role perspective, speech decoding is a process paced by an internal, hierarchical clock with uniform CTUs, a theta CTU with duration of one theta cycle in Newtonian time, and a delta CTU with duration of one delta cycle. The CTUs are synchronised with the input. A necessary condition emerges according to which, the sequence of operations to decode one syllable must be performed within one theta CTU and the sequence of operations to parse one phrase must be performed within one delta CTU. Importantly, these necessary conditions hold for any decoding computation strategy that may be in place, whether context-invoked or not, whether sequential or not, or whether oscillation driven or not. Hence, the scaffold for the speech decoding process is an acoustic determinant, realised by acoustic-driven theta and delta oscillators. Notes 1. In conventional models of speech perception phones are identified first, and the ordered sequence of identified phonemes results in a pointer to the word lexicon (e.g. Marslen-Wilson, 1987;Luce & McLennan, 2005;Stevens, 2005). 2. The acoustic cues to which the theta oscillator is locked to are still under debate (acoustic edges? vocalic nuclei?).
Here, the theta cycle is locked to vocalic nuclei, hence the syllable objects are in the form of VCVs (Ghitza, 2013). 3. A possible mechanism to generate the neural code is via gamma sampling (Shamir et al., 2009;Ghitza, 2011). 4. The delta oscillator is locked to accentuation attributes; the acoustic cues that form accentuation are still under pursuit. 5. The role of endogenous oscillations in generating abstract linguistic predictions (e.g. Meyer & Gumbert, 2018) is still under debate. 6. Newtonian time, in seconds. See Chapter "Newtonian and Bergsonian Time," in Wiener, 1948.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
This work was funded by a research grant from the USA Air Force Office of Scientific Research (AFOSR) [grant number FA9550-18-1-0055].