Viewing speech in action: speech articulation videos in the public domain that demonstrate the sounds of the International Phonetic Alphabet (IPA)

ABSTRACT In this article, we introduce recently released, publicly available resources, which allow users to watch videos of hidden articulators (e.g. the tongue) during the production of various types of sounds found in the world’s languages. The articulation videos on these resources are linked to a clickable International Phonetic Alphabet chart ([International Phonetic Association. 1999. Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet. Cambridge: Cambridge University Press]), so that the user can study the articulations of different types of speech sounds systematically. We discuss the utility of these resources for teaching the pronunciation of contrastive sounds in a foreign language that are absent in the learner’s native language.


Introduction
It is extremely difficult, if not impossible, to fully master the phonology of a non-native language, unless exposure to that language starts in early childhood (Seliger 1978). Yet, pronunciation is an aspect that many language learners aspire to master (Fraser 2010). Ironically, it is also an area that received little attention in the language classroom in the recent past, as well as an area that many teachers lack confidence in teaching (Foote 2015).
Pronunciation was a major concern of approaches such as the Reform Movement at the turn of the twentieth century, and the Audiolingual Method in the 1940-1950s. The later marginalisation of pronunciation instruction can be ascribed to several factors. Notably, the Communicative Language Teaching, which emphasised meaning over form, gained popularity in the 1970s. At around the same time, the efficacy of classroom pronunciation instruction was also questioned (e.g. Suter 1976 as cited in Ketabi and Saeb 2015). During the past decade, however, calls for pronunciation instruction have increased. This was preceded by a gradual spread of a view that 'intelligible pronunciation is an essential part of communicative competence' (Morley 1991, 488), and the emergence of evidence that pronunciation instruction is beneficial (Derwing and Munro 2005).
Along with the renewed interest in pronunciation instruction came new perspectives. Intelligibility and comprehensibility replaced native-like pronunciation as the official goal of the language classroom. The new goal is motivated by feasibility, practicality, and ideology, the latter two particularly for English, today's global lingua franca (Jenkins 2002;Jindapitak 2015;Thomson and Derwing 2015). In attaining intelligibility and comprehensibility, both individual sounds and suprasegmental features of speech (e.g. stress and intonation) are now considered important (Lee, Jang, and Plonsky 2014;Foote 2015;Ketabi and Saeb 2015). To what extent pronunciation instruction, which focuses on form, can be integrated into the speaking and listening components within a communicative framework remains to be seen (Isaacs 2009;Foote 2015).
Whatever the perspective, few would disagree that an inability to adequately produce contrastive sounds, or phonemes, in the target language (e.g. rice vs. lice) can hinder communication (Jenkins 2002). While articulatory instructions are effective in teaching the pronunciation of phonemes absent in the learner's native language (e.g. Catford and Pisoni 1970), traditional articulation-based descriptions of speech sounds may be daunting for language teachers as well as learners who are not accustomed to articulatory phonetics (Yule 1990).
In this article, we introduce recently released, web-based resources that would help such users grasp how different types of speech sounds are produced through videos of hidden articulators (e.g. the tongue) in action. These videos are linked to clickable International Phonetic Alphabet (IPA) charts (International Phonetic Association 1999), so that the users can study the articulation of various speech sounds systematically. The resources are accessible to anyone with Internet access, for free.
The rest of the article is organised as follows. Section 2 gives a technical overview of some data acquisition techniques used to capture hidden speech articulators (Section 2.1) and the IPA chart (Section 2.2). Section 3 introduces resources providing articulation videos linked to the IPA chart. Section 4 provides notes regarding the usage of these resources for improving pronunciation in a foreign language. The web-based resources can be accessed by clicking the URLs (Uniform Resource Locators) after the name of each resource, where it is discussed in the article text. The resources' URLs are also listed in Section 5. Section 6 provides a glossary of some phonological/phonetic terms used in this article; in the article text these terms are given in small capital letters on their first use.

Articulatory data acquisition techniques
When we speak, multiple articulators (e.g. VOCAL FOLDS, lips, VELUM, tongue, jaw) are delicately coordinated to produce strings of sounds that make up words, phrases, and sentences. Many of these articulators are difficult to observe during speech production. The X-ray technique was used in the past to study speech production, as X-rays make all articulators visible (Stone 2013). Though X-rays shed tremendous light on speech production, their use is now largely restricted to medical imaging because of increased awareness of associated health risks.
In place of X-rays, magnetic resonance imaging (MRI) is now widely used to study the VOCAL TRACT shape during speech production. MRI utilises radio signals produced by the hydrogen atoms in water and fat in our body, in response to a strong magnetic field created nearby. Figure 1(a) shows the midsagittal plane (profile view) of a female speaker's head, captured by MRI during her production of an ALVEOLAR PLOSIVE consonant [d], as in dine. As the figure shows, MRI characterises soft tissues well, and the overall vocal tract shape can be clearly seen, although MRI cannot easily image teeth, which contain little water or fat. The latter aspect is a drawback of MRI applications in speech research, as teeth play an important role when producing sounds like [f] and [θ], as in fin and thin. Furthermore, as the weak radio signals elicited during MRI scans need to be summed over time, it is difficult to obtain high-quality images at high frame rates, required to capture the rapid movements of some speech articulators, notably, the tongue. Other drawbacks include loud scanning noise, which interferes with audio recording (NessAiver et al. 2006).
Tongue movements during speech production are instead often studied using ultrasound tongue imaging (UTI). UTI capitalises on the fact that sound waves pass well through soft tissue but are reflected at tissue-air boundaries; real-time images of the upper surface of the tongue can thus be viewed during speech production by placing beneath the speaker's chin, a probe that emits and receives ultrahigh frequency sound waves. The strengths and weaknesses of UTI are almost complementary to those of MRI. As Figure 1(b) shows, UTI mainly shows the tongue surface only, and how the tongue surface relates to the rest of the vocal tract is unclear. Additionally, the tongue tip is often invisible in UTI data because of an air pocket underneath, or a shadow cast by the jaw bone. However, with UTI it is easier to achieve frame rates high enough to capture most tongue motions during speech production, with many modern ultrasound machines capable of scanning at around 90 fps (frames per second) (Stone 2013). Other advantages to UTI when compared with MRI include relative ease with which high-quality audio recordings can be obtained during data acquisition, as well as relatively low cost and convenience.

The IPA chart
The IPA consists of a universal set of symbols representing distinctive sounds of the world's languages, and is used to show pronunciation in many dictionaries (International Phonetic Association 1999). The IPA chart consists of several sections such as vowels, PULMONIC consonants, and non-pulmonic consonants.
The IPA chart can be a useful tool for teaching the basics of speech production, as it shows at a glance commonalities and differences between the articulations of various speech sounds. The vowels section of the chart, shown in Figure 2, arranges vowels along three dimensions: tongue height, tongue backness, and lip rounding. Figure 3 gives a partial view of the pulmonic consonants section. Each column of the table represents a place of articulation from BILABIAL to GLOTTAL (from the leftmost to the rightmost column), and each row a manner of articulation (e.g. plosive, NASAL, TRILL). It is easy to see in the table, for example, that [d] and [n] (as in dine and nine) are produced at similar places of articulation but with different manners of articulation (see International Phonetic Association 1999 for details).

Articulation video resources
The 'IPA Charts' section of the Seeing Speech website (Lawson et al. 2015b) http://www.seeingspeech. ac.uk, hosted by the University of Glasgow, provides a clickable IPA chart, which allows the user to access the articulation videos of desired sounds by clicking sound symbols on the chart. The   articulation videos on Seeing Speech come in three forms: MRI videos, UTI videos, and animated heads. The MRI videos feature a female phonetician and are synchronised with temporally matched, high-quality audio recordings of the same phonetician made in a quiet recording condition. The UTI videos, which feature the same female phonetician and a male phonetician, are played at original and then half speed.
The animated heads were created by combining the image of the entire vocal tract from the female phonetician's MRI data, with her tongue data from UTI and lip data from a video camera, both captured at better temporal and spatial resolutions than MRI. The resulting animations were sampled at 24 fps, at a higher frame rate than the 6.5 fps of the MRI data. Figure 5(a)-(c) outlines the animation creation process. As Figure 5(c) shows, the animated head has the upper and lower teeth; the location of the biting edge of the teeth was estimated from an MRI recording of an INTER-DENTAL articulation (where the incisors indent the tongue's surface). Individual videos on this website are also available on YouTube's ArticulatoryIPA channel https://www.youtube.com/playlist?list= UUuOKJqD00W2EiC3DHmOuu0g.
Another useful resource with a clickable IPA chart is rtMRI IPA Chart http://sail.usc.edu/span/rtmri_ ipa/index.html, created by the Speech Production and Articulation kNowledge Group, University of Southern California. This website was recently updated to provide high-frame-rate MRI videos of two female and two male phoneticians, synchronised with noise-cancelled audio recordings made simultaneously with the MRI recordings. The videos are sampled at 83 fps, using a sliding window technique to reconstruct MRI scans at a much higher frame rate than the rate at which the data were acquired (Narayanan et al. 2004). Along with the sounds on the IPA chart, the website provides MRI videos showing the production of some English words, sentences, and passages by phoneticians from the US.
The articulation videos on Seeing Speech and rtMRI IPA Chart can be played back without a plugin (e.g. QuickTime), that is, they can be viewed on smartphones and tablets as well as computers. Accessibility on devices used daily by the learners is potentially important, as improvement in pronunciation typically does not occur overnight (Thomson and Derwing 2015).
Finally, though the platform is limited to iOS devices (e.g. iPad, iPhone), iPA Phonetics (Coey, Esling, and Moisik 2014) allows the user to view videos of the LARYNX linked to an expanded IPA chart, and study LARYNGEAL articulations for various types of speech sounds found in the world's languages (Esling, Moisik, and Coey 2015). The application is free on the Apple Store.

Notes on usage of the resources
These recently released, publicly available resources provide articulation videos, which we hope will be useful when teaching/learning the pronunciation of phonemes absent in the learner's native language. We consider below how these resources should be used. First, as explained earlier, the IPA groups contrastive sounds of the world's languages into a set of universal phonetic categories. Thus, the articulation videos and associated audio recordings in the IPA-based resources broadly reflect different ways in which sounds represented by different symbols are produced independent of language; they necessarily do not tell us about more subtle differences in the realisation of any given sound category that exist between languages or language varieties (Ladefoged 1987). Consequently, these resources can be used to teach basic articulatory principles underlying the production of phonemes absent in the learner's native language, but not to teach cross-linguistic or cross-dialectal differences in the way a sound category is produced. 1 Second, their utility as audiovisual aids in formal articulatory instructions aside, how best to use these resources for drill-type pronunciation training is yet to be explored. There are some positive signs from studies reporting improved pronunciation of non-native phonemes through training with animated heads revealing the workings of internal speech articulators, and the learners' appreciation of such input (Massaro and Light 2003;Kröger et al. 2010;Dey 2012;Wang, Hueber, and Badin 2014). Nevertheless, it is surprisingly difficult to find studies unambiguously demonstrating additional benefits from visual presentation of hidden speech articulators, when compared with images of a model speaker's face. 2 While this may be partially ascribed to the paucity of studies directly addressing the question, it is worth bearing in mind that many late learners bring to language learning well-developed skills to extract articulatory information from the speaker's face (Hazan et al. 2005), but not from visual presentations of internal articulators. The following summarises the findings of studies on our ability in native language to extract articulatory information from visual displays of complete vs. cutaway heads presented with degraded acoustic information (Grauwinkel, Dewitt, and Fagel 2007;Wik and Engwall 2008;Badin et al. 2010). First, we are generally better at extracting articulatory information from videos of complete heads than cutaway heads. With training, many of us can learn to use articulatory information conveyed by cutaway heads, but individual differences in this ability are large. Second, learning to interpret a cutaway head is easier when the visual information is presented alone, than when the visual and acoustic information are presented bimodally.
Thus, familiarisation with videos of hidden articulators is perhaps required before learners can fully benefit from the videos, with some learners requiring more time and training (hence the potential importance of accessibility of these resources on smartphones and tablets). Furthermore, familiarisation and/or training may be more effective if carried out in stages, for example, by first presenting the videos without acoustic information and later with acoustic information (as in, for example, Bernhardt et al. 2008). 3 These recommendations are only tentative, as they are based on an assumption that native language speech perception in noise can be compared to the situation faced by late language learners, who cannot hear all the sound contrasts in the target language. We welcome future research on the effective use of videos of hidden speech articulators in pronunciation training. 6. Glossary of phonological/phonetic terms ALVEOLAR (A sound) associated with the placement of the tongue against/near the HARD PALATE, just behind the teeth. BILABIAL (A sound) produced using the upper and lower lips. GLOTTAL (A sound) produced with the glottis, that is, the opening between the VOCAL FOLDS.

URLs of articulatory resources
HARD PALATE The bony, front part of the roof of the mouth. INTERDENTAL (A sound) associated with the placement of the tongue tip between the upper and lower front teeth. LARYNGEAL Relating to LARYNX. LARYNX A complex organ, which houses the VOCAL FOLDS. It sits in the neck, below the tongue root and above the windpipe. NASAL (A sound) produced with passage of air through the nose. PLOSIVE (A sound) produced by complete closure of oral passage followed by a sudden release of higher pressure air. PULMONIC (A sound) produced using the air from the lungs. TRILL (A sound) produced by a rapid vibration of an articulator against another. VELUM The soft, back part of the roof of the mouth. VOCAL FOLDS Two folds of tissue in the LARYNX. They vibrate to produce voiced sounds. VOCAL TRACT The branching passage between the LARYNX and the mouth and nose. Notes 1. Readers interested in articulation videos of words and phrases produced with various regional accents of English are referred to the Dynamic Dialects website (Lawson et al. 2015a) http://www.dynamicdialects.ac.uk. 2. This should not be confused with reports of positive training outcomes using visual biofeedback provided by, for example, electropalatography or UTI for pronunciation training in a foreign language (Schmidt 2012;Wu et al. 2015) and speech therapy (Gibbon et al. 1996;Bernhardt et al. 2008). 3. Relatedly, the benefits of visual presentation of internal articulators appear dramatic in speech production training of people with hearing loss (Massaro 2003).