Similarities in face and voice cerebral processing

ABSTRACT In this short paper I illustrate by a few selected examples several compelling similarities in the functional organization of face and voice cerebral processing: (1) Presence of cortical areas selective to face or voice stimuli, also observed in non-human primates, and causally related to perception; (2) Coding of face or voice identity using a “norm-based” scheme; (3) Personality inferences from faces and voices in a same Trustworthiness–Dominance “social space”.

Although the nature of the sensory input is highly different for facial or vocal information, growing evidence suggests that the cerebral architecture processing these two types of signals is organized following similar principles (Yovel & Belin, 2013). This short paper provides a biased, non-exhaustive comparison of the perceptual and neural mechanisms involved in face and voice processing, focusing on a few examples, mostly from my own work, that illustrate their puzzling similarities.
a. Human neuroimaging. Functional magnetic resonance imaging (fMRI) studies in humans have evidenced Temporal Voice Areas (TVAs) in human auditory cortex (Belin, Zatorre, Lafaille, Ahad, & Pike, 2000) analogous to the "face areas" or "face patches" of visual cortex (Freiwald, Tsao, & Livingstone, 2009;Haxby, Hoffman, & Ida Gobbini, 2000;Kanwisher et al., 1997;Tsao et al., 2006). TVAs show greater response to voiceswhether they contain speech or notthan to other categories of nonvocal sounds from the environment or to acoustical control stimuli such as scrambled voices and amplitude-modulated noise. They are organized bilaterally in several clusters along the superior temporal gyrus and superior temporal sulcus of the temporal lobe (Belin et al., 2000;Belin, Zatorre, & Ahad, 2002;Linden, Thornton, Kuswanto, Johnston, & Jackson, 2011;Von Kriegstein & Giraud, 2004). A recent large analysis of cerebral voice sensitivity in several hundred participants (Pernet et al., 2015) demonstrates that TVAs are the most salient part of a "vocal brain", a bilateral, distributed network of cortical and subcortical regions showing significant voice-sensitivity including in particular inferior prefrontal areas and the amygdala. A cluster analysis of peaks of voice sensitive response in these hundreds of participants provided evidence of an organization in three "voice patches" along the antero-posterior axis of the superior temporal sulci and gyri bilaterally ( Figure 1).
b. Causal link with perception. The increased response to faces seen in face-sensitive cortical areas is causally related to face processing. Transiently interfering with neuronal populations in the occipital face area using transcranial magnetic stimulation (TMS) results in specific drops in behavioural performance tasks involving the processing of faces, but not other objects (Pitcher, Charles, Devlin, Walsh, & Duchaine, 2009;Pitcher, Walsh, Yovel, & Duchaine, 2007). Likewise, TMS stimulation of the TVA interferes with voice perception (Bestelmeyer, Belin, & Grosbras, 2011). Repetitive TMS stimulation of the TVA peak in the right hemisphere induces a performance level difference between a voice perception task (voice/nonvoice categorization) and a low-level nonvocal auditory task (loudness judgment) that is not observed when stimulating a control site in parietal cortex (Bestelmeyer et al., 2011). Although that study is but the starting point of a line of research potentially as productive as that using TMS to dissect face processing, it clearly establishes a causal link between TVA activation and voice perception, as found for OFA activation and face processing. c. Human electrophysiology. Electrophysiological techniques such as electro-(EEG) or magnetoencephalography (MEG) reveal comparable time courses for face and voice processing. For faces, both techniques reveal a well-known N170/ N170 m component most prominent on occipitotemporal electrodes bilaterally with generally higher amplitude in response to faces than to other objects (Bentin et al., 1996). Similarly, highdensity EEG shows that bilateral fronto-temporal electrodes display a component in the P200 range called the "fronto-temporal positivity to voices" (FTPV) with a larger amplitude in response to vocal compared to nonvocal sounds as early as about 170 msec after sound onset (Charest et al., 2009). MEG confirms this finding and identifies sources of the FTPVm in anterior/posterior STG/ STS bilaterally, overlapping with the fMRI-derived anatomical location of TVAs (Capilla, Belin, & Gross, 2013). Thus, it takes about two-tenths of a second for our brain to differentiate face or voice from other signals in the same sensory modality. d. Non-human studies. FMRI studies face processing in macaques have consistently identified a series of cortical "face patches" and, together with electrophysiology, described their functional properties to a more detailed degree than is feasible in humans; the results have suggested a series of increasingly detailed face representations as one moves anteriorly towards the frontal lobe, with some patches containing a large proportion of "face cells", i.e., neurons displaying face preference at the individual level (Freiwald & Tsao, 2010;Freiwald et al., 2009;Tsao et al., 2006).
The current understanding of voice processing in the macaque brain is less advanced and different studies yield partly conflicting results with some groups not finding any reliable cortical areas differentiating conspecific vocalization from other complex sounds (Joly et al., 2012), while other groups show differences in variable cortical areas (Gil-da-Costa et al., 2004;Gilda-Costa et al., 2006;Ortiz-Rios et al., 2015;Petkov et al., 2008). A pioneering study by Chris Petkov and colleagues provided clear evidence for TVAs in the macaque brain: using fMRI in awake macaques during auditory stimulation with macaque vocalizations and other sound categories they observed, in two animals, several "voice patches" in temporal lobe with greater activity in response to macaque vocalizations than to other complex sounds (Petkov et al., 2008) the macaque equivalent of the human voice areas. One of the voice patches, located in right anterior temporal lobe, showed adaptation to speaker identity (Petkov et al., 2008), similar to evidence obtained in humans in an analogous anatomical location (Belin & Zatorre, 2003). Moreover, single cell recordings performed in this fMRI-identified location provided the first evidence of "voice cells", i.e., individual neurons showing significant selectivity to conspecific vocalizations (Perrodin, Kayser, Logothetis, & Petkov, 2011). These results are important in that they strongly suggest that the last common ancestor of humans and macaques, some 20-25 million years ago, were already equipped with rudimentary voice-selective cortical mechanisms. Thus, when our hominin ancestors started speaking a few tens or hundred thousand years ago, they were already equipped with neural mechanisms tuned over millions of years to analyse voice information. Interestingly, fMRI studies in dogs have also recently provided evidence of voice areas in the dog brain: areas responding significantly more to dog vocalizations compared to other sounds (Andics, Gacsi, Farago, Kis, & Miklosi, 2014), which pushes back the emergence of the vocal brain to 80 million years ago. Future studies should confirm the existence of these voice patches, detail their anatomical location and inter-individual variability, and examine potential differences in underlying voice representations.

Identity processing
Faces and voices are the two most important signal categories allowing us to recognize other individuals. A large number of studies have investigated the functional and neuronal architecture underlying face recognition (Bruce & Young, 1986;Calder & Young, 2005;Tsao & Livingstone, 2008;Young & Bruce, 2011); comparatively less effort has been devoted to studying voice speaker recognition (Blank, Wieland, & von Kriegstein, 2014;Perrodin, Kayser, Abel, Logothetis, & Petkov, 2015), however what is known reveals troubling similarities with face recognition.
a. Selective recognition deficits. Selective recognition deficits are known to occur for both face and voice identity processing. Since Bodamer (1947), many cases of "prosopagnosia" have been documented: patients who following a brain lesion become unable to recognize previously known faces while still being able to recognize nonface object categories (Rossion, 2014;Sergent & Signoret, 1992). Some persons even present face recognition impairments in the absence of any evident brain damagea deficit termed "developmental" or "congenital" prosopagnosia (Duchaine & Nakayama, 2006). Importantly other aspects of face perception, such as the ability to recognise emotions or lip-read, seem to be preserved in prosopagnosic patients, indicating that the functional pathway underlying face identity processing is partially dissociated from those underlying emotional or speech information processing (Bruce & Young, 1986;Young & Bruce, 2011). A directly comparable deficit is also known to occur for speaker recognitionalthough documented in a much smaller number of cases (Van Lancker, Cummings, Kreiman, & Dobkin, 1988;Van Lancker, Kreiman, & Cummings, 1989). This deficit in speaker recognition, called "phonagnosia", also occurs more frequently after right hemisphere lesions. As for prosopagnosia, the deficit seems quite selective to identity information processing as these patients typically show normal speech comprehension or emotion recognition from voice. A small number of cases of so-called "developmental phonagnosia", presenting the deficit quite selectively without evident brain damage, have also recently been described (Garrido et al., 2009;Roswandowitz et al., 2014;Xu et al., 2015). A detailed investigation of the neural correlates of such deficits is the subject of ongoing investigations by several groups. b. Perceptual coding. How is voice or face identity coded in the brain? In both cases there appears to be distinct mechanisms for the coding of identity in familiar or unfamiliar faces/voices. In the case of faces this is shown for instance by the good performance of typical observers to match different views of a familiar faces in tests such as the Benton test, whereas different views of a same unfamiliar face are often perceived as corresponding to different identities. For voices a comparable dissociation has been observed between discrimination of unfamiliar speakers versus recognition of known speakers (Van Lancker et al., 1988).
The coding of unfamiliar identities appears to be performed for both faces and voices using a Norm-based coding mechanism. For faces, at the behavioural level, adaptation after-effects cause larger identity categorization performance differences for face adapters that sit opposite the target face relative to the prototypical face so-called called "anti-faces"which has been interpreted as highlighting the special role of the prototype in identity coding (Leopold, O'Toole, Vetter, & Blanz, 2001;Rhodes & Jeffery, 2006). At the neuronal level, single cell recordings in macaque inferior temporal lobe (Leopold, Bondar, & Giese, 2006) as well as human fMRI measures of FFA activity (Loffler, Yourganov, Wilkinson, & Wilson, 2005) indicate that faces more dissimilar to an identity-free face prototype (approximated by computer averaging of many different faces) elicit greater neuronal activity than less distinctive faces, more similar to the prototype. For voices, remarkably similar evidence has recently been obtained in my group. As for faces, behavioural adaptation after-effects induced by "anti-voice" adapters induce greater perceptual shifts in speaker identification than non-opposite adapters (Latinus & Belin, 2011). Moreover, voices that are more acoustically different from an internal voice prototype (approximated by the morphing generated average of many voices) are perceived as more distinctive, and elicit greater activity in the TVAs, than voices with a shorter distance to mean, more acoustically similar to the prototype (Latinus, McAleer, Bestelmeyer, & Belin, 2013) (Figure 2).

Social perception
Another aspect of face and voice processing that features baffling similarities is the formation of social percepts and inferenceshow we judge someone unknown as attractive, competent or untrustworthy based on a glance at their face or a word heard.
a. Attractiveness and averaging. It has long been established that averaging faces makes them more attractiveface composites generated by averaging faces from several different identities are judged more attractive than individual faces (Galton, 1878;Langlois & Roggman, 1990)although it is clear that there is more to attractiveness than mere averageness (DeBruine, Jones, Unger, Little, & Feinberg, 2007). Two main explanations have been proposed for this phenomenon: the "good genes" account, proposing that we perceive averaged faces as more attractive because if they were real faces they would belong to individuals with high genetic fitness (with more symmetrical features, fewer imperfections), good potential mates (Grammer, Fink, Moller, & Thornhill, 2003;Langlois & Roggman, 1990;Thornhill & Gangestad, 1999); the "perceptual fluency" account, that averaged faces are preferred because they are closer in face space (more perceptually similar) to a putative internal prototype and hence easier to process (Halberstadt & Rhodes, 2003;Winkielman, Halberstadt, Fazendeiro, & Catty, 2006). Both accounts predict that a similar phenomenon should be observed for voices: that averaging the voice of several individuals should result in a more attractive voice. Thanks to the recent advent of voice morphing tools (Kawahara & Matsui, 2003), we were able to test this prediction and indeed a steady increase of attractiveness ratings along with number of averaged voices in a composite was observed (Bruckert et al., 2010), similar that that found for faces ( Figure 3). More than a curiosity, the phenomenon offers a window onto the perceptual mechanisms of voice attractiveness and reveals that two main, independent, acoustical features are at play: the amount of spectro-temporal irregularities (measured for instance by the harmonics-to-noise ratio) and "distance-tomean" where voices acoustically more similar to the average are generally found more attractive (Bruckert et al., 2010). These two acoustical parameters are directly analogous to two features known to be key determinants of facial attractiveness: face texture smoothness and distance to mean. b. Social inferences in a same 2D "social space". Social face perception research has shown that people readily form personality impressions from unknown faces. These social inferences are formed rapidlya mere second is needed to reach competence judgments that predict election margins (Todorov, Mandisodza, Goren, & Hall, 2005) and robustly: impressions may not be accurate but different people tend to agree on them. The diverse personality impressionscompetence, aggressiveness, friendlinessare well summarized by a 2D "social face space" with perceived Trustworthiness and Dominance as the two main axes (Oosterhof & Todorov, Maps of Spearman correlation between beta estimates of BOLD signal in response to each voice stimulus and its distance-to-mean overlay on the TVA map (black). Colour scale indicates significant r values (p < .05 corrected for multiple comparisons). Note a bilateral distribution with a maximum along the right anterior STS. (C) Scatterplots and regression lines between estimates of BOLD signal and distance-to-mean at the peak voxel for "had" syllables. (D) Scatterplots and regression lines between estimates of BOLD signal and distance-to-mean at the peak voxel observed for "hellos". Reproduced (permission pending) from Latinus et al. (2013).
2008). We recently found that the exact same conclusions can be reached for voices. From a single word "hello", listeners were found to form personality impressions with high inter-rater agreement, and principal component analysis reveals that the diverse personality judgments are also well summarized in a 2D "social voice space" with main axes best corresponding to Trustworthiness and Dominance, exactly as for facesa finding observed for both male and female voices (McAleer, Todorov, & Belin, 2014).
The above sections have provided a short and voluntarily partial, but I believe no less striking, account of the similarities in face and voice processing. These similarities, observed in several different domainsdiscrimination from other stimuli, processing of identity information, social inferencesare perhaps not so surprising, considering that the processing of these different types of information poses similar problems to the cortical architecture, that are apparently implemented using similar solutions; which also no doubt facilitates the integration of this information across different modalities during naturalistic encounters (Campanella & Belin, 2007).

Disclosure Statement
No potential conflict of interest was reported by the author(s).