Literature DB >> 25132626

Infant perceptual development for faces and spoken words: an integrated approach.

Tamara L Watson¹, Rachel A Robbins, Catherine T Best.

Abstract

There are obvious differences between recognizing faces and recognizing spoken words or phonemes that might suggest development of each capability requires different skills. Recognizing faces and perceiving spoken language, however, are in key senses extremely similar endeavors. Both perceptual processes are based on richly variable, yet highly structured input from which the perceiver needs to extract categorically meaningful information. This similarity could be reflected in the perceptual narrowing that occurs within the first year of life in both domains. We take the position that the perceptual and neurocognitive processes by which face and speech recognition develop are based on a set of common principles. One common principle is the importance of systematic variability in the input as a source of information rather than noise. Experience of this variability leads to perceptual tuning to the critical properties that define individual faces or spoken words versus their membership in larger groupings of people and their language communities. We argue that parallels can be drawn directly between the principles responsible for the development of face and spoken language perception.

Entities: Chemical Disease Gene Species

Keywords: face recognition; face space; infant perceptual development; perceptual assimilation; perceptual narrowing; speech perception; spoken word recognition

Mesh：

Year: 2014 PMID： 25132626 PMCID： PMC4231232 DOI： 10.1002/dev.21243

Source DB: PubMed Journal: Dev Psychobiol ISSN： 0012-1630 Impact factor: 3.038

Language has long been held to be a defining ability that distinguishes humans from other animals. That is, it has been considered to be species-specific (e.g., Chomsky, 2006; Deacon, 1997). Relatedly, language acquisition has often been assumed to require an elaborated, specialized neural module that is uniquely devoted to language and thus divorced from more general cognitive skills (Coltheart, 1999; Fodor, 1983). The acquisition mechanisms have been thought to be domain-specific. Yet language is not the sole focus of such claims for biological specialization of our perceptual and cognitive skills. Human face recognition is another capability that has also been posited to be specialized (domain-specific) and species-specific (e.g., de Schonen & Mathivet, 1989; Morton & Johnson, 1991). More recent research, however, has shown that both abilities undergo substantial “tuning” by environment-specific experience during the first year of life. Specifically, as infants develop they tend to show: both a “narrowing” of perceptual ability away from discrimination between less experienced stimuli, and an “elaboration” or increase in discrimination and categorization ability for often experienced stimuli. This experience-based perceptual tuning poses some challenge to claims that language and face recognition are biological specializations. Moreover, certain language-like abilities (Gervain & Mehler, 2010), as well as the ability to recognize individual human faces (Peirce, Leigh, daCosta, & Kendrick, 2001), have been demonstrated in other animals, leading some theorists to question species-specificity for both abilities. There has also been increasing experimental evidence on the development of spoken language perception and face recognition in infancy that indicate the processes involved may not be entirely domain-specific (e.g., see Bahrick & Lickliter, 2012; Bulf, Johnson, & Valenza, 2011; Pascalis, de Haan, & Nelson, 2002; Scott & Monesson, 2009; Scott, Pascalis, & Nelson, 2007; Walker-Andrews, 1997). Importantly to the present article, this evidence also suggests fundamental parallels in the ways the two skills emerge in infants. The purpose of this article, therefore, is to examine the parallels in the development of spoken word and face perception in infancy and outline a proposal of their theoretical implications. Many things that humans can do outwardly appear to involve quite separate and qualitatively different skills, such as language, face recognition, music-making, dancing, mathematical calculation, etc. Yet similar developmental trajectories for any seemingly distinct pair of abilities could offer a clue that both skills may be underwritten by a common fundamental set of mechanisms deployed to subserve disparate functions. As the link between perceptual behavior and the sensorimotor and cognitive functions of the brain is increasingly revealed by research on infant development, we envisage that these “fundamental mechanisms” could be functioning at many different levels of resolution spanning neuroscience and psychology. For example, the means by which neural connectivity in sensory cortex is shaped based on experienced stimulation during infancy could be common across sensory domains. Despite the experienced input differing across the sensory modalities, any common constraint on the development of neural connectivity patterns would result in the development of these sensory skills sharing important characteristics that would be apparent in the developmental trajectory of both skills. Based on the integrative review of the development of face and spoken word perception skills that we present here, we propose that such a fundamental set of mechanisms underpins both of these abilities and is evident in the perceptual behavior of developing infants. We posit that these mechanisms/principles organize incoming sensory information into meaningful domain-relevant categories, such as the phonemes (consonants and vowels) of words in native-language speech, or the faces of individuals and subgroups within our social circle. Additionally, we will outline the importance of systematically structured variability in the natural environmental input for the infant's acquisition of these perceptual abilities. We propose that utilizing such natural variability might lead to similar outcomes in terms of perceptual spaces, or internal representations, for speech and faces. Importantly, we posit that these variability-based perceptual spaces are organized around the critical dimensions of variation that perceivers discover across the variable surface details of the speech and faces they experience in their environment (see, e.g., Best, in press; see also, E. J. Gibson, 1969; J. J. Gibson & E. J. Gibson, 1955). Not only should these perceptual spaces be based around the dimensions of variation, we propose that the perceptual space should be considered to be composed primarily of these dimensions, rather than being composed of a suite of exemplars or even of norms that specify the central tendency of each dimension.

WHAT KIND OF LEARNING IS INVOLVED?

The type of developmental learning we are talking about is fine grained, the foundation for skilled discrimination between and categorization of individual tokens (e.g., spoken words or individual faces) that exist within a crowded environmental space. The environmental space can be considered crowded if the inputs or signals needing individuation share a high level of similarity across multiple dimensions of structured variability. For example, faces vary within certain constraints along several visible dimensions, yet all share a common configuration with two eyes arranged above a nose that is above a mouth, in which small differences in, for example, spacing between the eyes, can change their appearance dramatically. In spoken language, multi-dimensional variability is also ubiquitous. Very small physical differences in production of the consonants or vowels of a word (i.e., phonetic differences) can signal large differences in meaning. For example, the single phonetic difference between the words PARK and BARK is that the vocal cords start vibrating to produce voicing a few tens of milliseconds later in P than in B. From within this type of crowded environmental space one person or word often also needs to be discriminated/recognized across a wide range of transformations or systematic variations. For example, the identity of a person needs to be recognized despite a change in facial expression, lighting and pose relative to the viewer. The recognition of individual words needs to be maintained across differences in the speech styles and emotional expression of the voice of an individual speaker, and across speakers including those who speak with different accents. Moreover, the same visual or auditory input may need to be used in a variety of ways. For example, the sex of the face or of the speaker may also need to be established across all the variations previously outlined, as may race, age etc. In short, a range of categorizations can and must be made from the same object, involving many variations in appearance and/or sound. The distinctions that need to be made about the environmental input can be pictured as being supported by a “perceptual space” that describes this information within the perceptual/cognitive system of the perceiver. In the case of spoken word recognition and face recognition we can call these internal perceptual spaces the perceiver's word space and face space (Valentine, 1991), respectively. To achieve an extremely flexible perceptual proficiency, many different aspects of the systematic variability between experienced instances in the environment will need to be characterized within the perceptual space. Very different aspects of the incoming information signal that the same word has been spoken by a male compared to a female, relative to those signaling that the same word has been spoken by two females with different accents. Likewise, the cues signaling that two people share the same facial expression will be different from those that signal that a person is from a different race than the perceiver is. While other kinds of perceptual activities can also require skilled perception (e.g., musical abilities), we focus on the acquisition of face recognition and spoken word and phoneme perception skills. They are both obligatory to developing and maintaining social relationships, and thus are central to us as humans.

COMMON DEVELOPMENTAL PROGRESS ACROSS SENSORY DOMAINS

Researchers often acknowledge that spoken language and face recognition are comparable in that perceptual narrowing occurs in both areas (e.g., Lewkowicz & Ghazanfar, 2009; Pascalis et al., 2002; Scott et al., 2007). Perceptual narrowing is the observation that young infants have the ability to sense a wide variety of stimuli, but these abilities become selectively narrowed as a result of exposure to the specific patterns of stimulation in their environment. Thus perceptual narrowing refers, on the one hand, to developmental improvement in perception of often-experienced stimuli, reflecting the strengthening of neural pathways that are consistently stimulated. Perceptual skills are believed to decline, on the other hand, for stimuli the individual is not exposed to, as a result of unused/unstimulated neural pathways becoming less efficient through processes such as synaptic pruning. The similarity in developmental trajectory in both the speech/spoken word perception and the face perception domains implies to us that a common principle or set of principles is responsible for the development of these and possibly other skilled categorization and discrimination abilities (music perception, for example). We will first consider separately the development of spoken word and phoneme perception, and face perception. We will then outline the development of audio-visual capabilities across face and speech perception. What is apparent in the first year of life is that infants show incredible initial sensory acuity, and that their perceptual skills become tuned by the specific sensory environment they experience and we argue, as well, by the presence of structured variability within that environment.

DEVELOPMENT OF SPOKEN WORD AND PHONEME PERCEPTION

Here we present an overview of development of speech perception and spoken word recognition ability. Newborns (birth up to 2 months)1 have a preference for normal speech compared to speech played backwards, filtered or computer-modified, for example, sinewave speech (Dehaene-Lambertz, Dehaene, & Hertz-Pannier, 2002; Vouloumanos & Werker, 2007a,b), but do not yet prefer speech over animal vocalizations or natural environmental sounds (Shultz & Vouloumanos, 2010; Vouloumanos, Hauser, Werker, & Martin, 2010). Within the speech domain, however, newborns do already prefer infant directed speech (IDS) over adult directed speech (ADS) (Cooper & Aslin, 1990). IDS is found to contain a wider range of variation than ADS along a number of dimensions. This increased yet systematic variation is important to our proposed theoretical framework, and we discuss it in detail below in the Variability in Infant Directed Spoken Interactions Section. Importantly for insights about perceptual narrowing/attunement, newborns also show a preference for their mother's language (Byers-Heinlein, Burns, & Werker, 2010; Mehler et al., 1988; Moon, Cooper, & Fifer, 1993), and their mother's voice (Mehler, Bertoncini, Barrière, & Jassik-Gerschenfeld, 1978). While the preference for normal rather than distorted speech could reflect biological and/or experience-based influences, their very early preferences for maternal language(s) and voice strongly suggest some influence from in utero auditory experience, which is compatible with the fact that the fetal auditory system is functional during at least the final prenatal trimester (Lickliter, 1993). In keeping with these more global preferences, newborns are able to make a range of finer speech discriminations that appear to set them up for learning about their spoken language environment. Newborns (as we have defined, i.e., birth to 2 months1) are able to discriminate most consonant and vowel contrasts found across the languages of the world (e.g., reviews Aslin & Pisoni, 1980; Werker, 1989). They can also detect acoustic cues to word boundaries (Christophe, Dupoux, Bertoncini, & Mehler, 1994) and discriminate words that differ in lexical stress (Sansavini, Bertoncini, & Giovanelli, 1997). They appear to be poised to make the most of the varied input they are exposed to, and to learn most from their primary caregiver, including even prenatal experience with mother's voice and language(s). At this stage of development the newborn can be considered optimally attuned for experiencing the complex variations in speech input, with which their perceptual space is molded to their native language environment. Between 2 and 6 months of age infants show perceptual patterns and preferences in the speech domain, some of which indicate further attunement to native speech properties. Whereas newborns' preference for natural over artificial audio stimuli is quite broad, extending from speech to rhesus monkey calls and other natural non-speech environmental sounds, these listening preferences narrow down to human speech by 3 months (Shultz & Vouloumanos, 2010; Vouloumanos et al., 2010). Additionally, 4-month-olds, like newborns, can discriminate between spoken passages of languages from different rhythmical classes (e.g., English, a stress-timed language, versus Japanese, a mora-timed pitch accent language, or French, a syllable-timed language that lacks stress contrasts) (Nazzi & Ramus, 2003), but by 5 months of age infants have been shown to discriminate languages from within the same rhythmical class (e.g., English and Dutch, both stress-timed languages) (Nazzi, Jusczyk, & Johnson, 2000) if one of the languages is familiar. Despite showing some tuning to the global prosodic properties of connected speech, however, infants in this age range do not yet demonstrate tuning to the native consonant or vowel contrasts of their own language environment as they appear to still be able to discriminate most consonant and vowel contrasts they have been tested on, whether used in their native language or only in languages they have not experienced (Aslin, Pisoni, Hennessy, & Perey, 1981; Eilers, Gavin, & Wilson, 1979; Eimas, Siqueland, Jusczyk, & Vigorito, 1971; Trehub, 1976). There are some key exceptions to this pattern that are interesting. At 4 months of age discrimination is poor for certain native consonant contrasts, such as English /d/-/ð/ (as in doze-those) (Polka, Colantonio, & Sundara, 2001; see also Narayan, Werker, & Beddor, 2010, for evidence of early difficulties in discrimination of native nasal consonant contrasts). Conversely, there is also evidence of differences in perception of Kikuyu (Kenya) stop consonant voicing contrasts by native- versus non-native-learning infants (English-learning) at 2 months (Streeter, 1976). These latter findings suggest that not all consonant contrasts are created equal, with some apparently being influenced earlier by experience and others requiring more extended experience. These differences could pose some difficulties for a simple view of perceptual narrowing. Of particular interest for our hypothesis regarding the importance of experiencing natural systematic variation, 4- to 5-month-olds do show perceptual constancy for native vowel categories and contrasts, specifically recognizing the same vowel when it is spoken by both adult and child speakers of either gender (e.g., Kuhl, 1979, 1983). Between 6 and 9 months of age additional perceptual narrowing to the finer-grained phonemic categories of native speech occurs. By 6–8 months, infants' discrimination of non-native vowel contrasts is declining (e.g., Polka & Bohn, 1996, 2003; Polka & Werker, 1994), but there is no evidence of a decline in discrimination of non-native consonants until several months later (for a review, see Werker & Tees, 2005; also section on 9–12 months, below). Moreover, by 6 months infants show within-category perceptual differentiation of good versus poor exemplars of native vowels but show no evidence of doing so for non-native vowels (Kuhl et al., 1992). At this same age, infants are also able to segment words from continuous speech, and show a preference for content words as compared to function words (Shi & Werker, 2001), even though specific function words (e.g., THE) occur much more frequently than specific content words (e.g., DOG). This preference for content words may be taken to reflect the infant's apparent disposition at this age for engaging with stimuli from the informationally richest rather than the statistically most frequent categories experienced. Although specific content words are much less numerous in speech than specific function words they tend to be longer, more variable and arguably provide the core meaning of a sentence. By 7–8 months, infants have also developed the ability to recognize familiarized words across a change in amplitude (loudness), but have not been found to generalize this recognition across changes to fundamental frequency, speaker, gender, or affect (Singh, Morgan, & White, 2004; Singh, White, & Morgan, 2008), unless the words were previously highly familiar to them (e.g., MOMMY and DADDY) (Singh, Nestor, & Bortfeld, 2008). Between 9 and 12 months of age, discrimination of many non-native consonant contrasts shows a dramatic decline (e.g., Best, McRoberts, LaFleur, & Silver-Isenstadt, 1995; Best & McRoberts, 2003; Werker & Lalonde, 1988; Werker & Tees, 1984; Werker, Yeung, & Yoshida, 2012; Yoshida, Pons, Maye, & Werker, 2010; see reviews by Best, 1994; Werker & Tees, 2005) and discrimination of many non-native vowel contrasts has declined further from the levels seen at 6–9 months (e.g., Polka & Werker, 1994). There are some interesting and informative exceptions to this pattern, however. Discrimination of some non-native consonant contrasts remains good even past 12 months of age despite considerable perceptual narrowing for other contrasts at this age (e.g., for English-learning infants, Tigrinya dental vs. bilabial ejective consonants: Best & McRoberts, 2003; Zulu dental vs. lateral click consonants: Best, McRoberts, & Sithole, 1988; and Nu Chah Nulth velar versus uvular vs. pharyngeal fricatives: Tyler, Best, Goldstein, & Antoniou, 2014). These results for non-native consonants suggest that discrimination is maintained over the 10–12 month period only if the articulator or feature distinctions are used in native consonant contrasts, or if the articulatory properties of the non-native consonants are so highly discrepant from native consonants that adults perceive them as non-speech sounds (outside the native phonological system altogether). There are, conversely, certain native contrasts that are difficult for younger infants to discriminate, and some of these continue to be poorly discriminated until as late as 4 years of age (such as /d/ vs. voiced TH as in there) (Polka et al., 2001; Sundara, Polka, & Genesee, 2006; see also Cristià, McGuire, Seidl, & Francis, 2011). However, there is also evidence of perceptual elaboration, or improved discrimination for certain other native contrasts (such as English /r/ vs. /l/), discrimination of which shows a decline by this age in infants whose native language does not use these contrasts, for example, Japanese (Kuhl et al., 2006). Concerning development of perceptual constancy, by 9 months infants can recognize words across discrepancies not only in amplitude but also in fundamental frequency from previously unfamiliar words and non-words they have been familiarized with in the laboratory (Singh, White, et al., 2008). By 10.5 months their ability to recognize such newly familiarized words extends as well across a change in the emotional expression of the speaker, to new speakers and to differences in speaker gender (Singh et al., 2004; Singh, Nestor, et al., 2008). By 9–12 months, then, infants' perceptual word space is a developing model of both the phonemes and the spoken words of the language environment they will operate within, as opposed to an open model of all possible spoken languages. Not only is perceptual narrowing occurring, however, a strategic perceptual elaboration is also taking place. Recognizing a word across speakers or affects reflects a skill that results from experience not just of spoken language itself but also from experience with other information in the input such as contextual clues (e.g., facial expression), and dynamic feedback between the infant and the speaker. In this sense the learning, while still being informed by or capturing the variability in the signal, is no longer purely statistical. The kind of perceptual invariance that is emerging is the beginning of being able to abstract away from pure, surface-level environmental statistics toward deriving more abstract rules that will support the formation of categories that include any number of quite dissimilar forms of spoken words. Although perception of vowel and consonant contrasts have become largely tuned to the native phoneme inventory by 9–12 months, and the ability to segment and recognize familiarized words from connected speech has become fairly robust to variations in speaker, gender, emotion and other superficial speech properties, word learning and word recognition abilities are still not adult-like at the end of the first year. Eleven- to 12-month-olds prefer listening to sets of words that are well-known to children of this age-group, as compared to listening to unfamiliar, low-frequency adult words they have never before heard. However, unlike adults, this familiar word preference extends broadly to mispronunciations of those words, such that they appear to accept non-words that differ by a single consonant from words that they know, as viable variants of the known words, for example, *VABY2 and *GAIRE are equally preferred as BABY and BEAR (e.g., Hallé & de Boysson-Bardies, 1994, 1996; Mulak & Best, 2013). In short, their perception of words still has further important refinements to undergo (not surprisingly). Beyond 12 months, in the first half of the second year, there is further perceptual attunement in children's learning and recognition of spoken words, both in terms of recognizing phonemic contrasts that distinguish words and in terms of constancy in recognizing words across variations that do not change word identity (see Best, in press; Best, Tyler, Gooding, Orlando, & Quann, 2009; Mulak & Best, 2013). At 14–15 months, several studies suggest that children's ability to distinguish between newly learned words and single-consonant changes to those words is still fairly tenuous (Stager & Werker, 1997; Swingley & Aslin, 2000; Yoshida, Fennell, Swingley, & Werker, 2009), but is somewhat improved over the 11- to 12-month-old's in that they can recognize a change and reject a mispronunciation if the word is either previously very familiar to them, or the task demands are minimized while contextual support for word recognition is optimized (Fennell & Waxman, 2010; Fennell & Werker, 2003). By 18–19 months performance on such word-recognition and word-learning tasks, however, is much more robust and reliable (Swingley, 2003,2007). In addition, somewhere between 15 and 19 months of age toddlers seem to move from being able to identify familiar words only when spoken in their native accent to also being able to identify the words when spoken in an unfamiliar accent (Best et al., 2009; Mulak, Best, Tyler, Kitamura, & Irwin, 2013). It seems that they become able to abstract their experiences to assess whether a word spoken in an accent they may never have encountered can be related to the phonological form of words they have experienced in their own native accent. As regional accents can change the phonetic details of words dramatically, this is a sophisticated form of perceptual constancy where there may be multiple interpretations of the input. The word the infant is listening to in another accent cannot fit any stored exemplar or be represented by any existing experience-based prototype. This is because the unfamiliar accent has previously not been encountered and it changes the low level phonetic detail of the word substantially. Thus, this kind of perceptual constancy is impossible to describe within a system that represents its input by extracting normalized prototypes or even by encoding an extensive list of experienced exemplars. This kind of constancy would seem more to be supported by coming to recognize that words occupy malleable and dynamic regions along multiple dimensions in a perceptual word space and that in any given situation some of these dimensions will be more important than others in identifying the word. This type of phonological constancy for recognizing words must arise from discovering recurring multidimensional patterns within the structured variability of the language spoken in the child's environment.

DEVELOPMENT OF FACE PERCEPTION

Here we present an overview of development of face recognition ability. Newborns (birth up to 2 months) also show surprising visual capabilities that suggest the visual system at birth has biases to attend to basic structural properties of faces and hence to experiences that will allow important distinctions to be made. Newborns show a preference for schematic faces compared to scrambled versions of the same stimulus (Johnson, Dziurawiec, Ellis, & Morton, 1991) that is likely driven by a preference for images that share similar stimulus energy to faces (Kleiner & Banks, 1987) and additionally a similar structure (Kleiner, 1987). Given a stimulus composed of the parts of a face newborns will look preferentially at stimuli that are top heavy (Cassia, Turati, & Simion, 2004) and also at the spatial arrangements of basic shapes that most closely convey a face-like appearance (Cassia, Valenza, Simion, & Leo, 2008). When presented with two different photos of the same face newborns will also preferentially look at the version of the image containing the face that is gazing directly at them (Farroni, Csibra, Simion, & Johnson, 2002). They are also able to discriminate between images of faces on the basis of both external and internal facial features when each is presented in isolation. When hairline is kept constant, newborns who have some experience with faces show a preference for faces that adults rate as attractive (Slater et al., 1998). Mathematically averaged faces are consistently rated as more attractive than the individuals contained within the average (Langlois & Roggman, 1990). However, when the hairline is visible, it appears that it is the preferred cue and precludes processing of the internal features (Turati, Macchi Cassia, Simion, & Leo, 2006). Despite demonstrating sophisticated perceptual abilities, newborns' apparent favoring of the hairline is in line with their lower acuity than older children and adults. It has been suggested that newborns rely on spatial frequencies around .5 cycles per degree of visual angle when recognizing static faces (de Heering et al., 2008), making the hairline a very salient visual cue and highlighting that the developmental progress of the visual system itself is an important factor in determining the information a newborn can extract from a face. Intriguingly, face recognition also shows early signs of perceptual constancy. Newborns only 1–3 days of age are able to match faces across a rotation of 45°: between a full-face and a 3/4 view (Turati, Bulf, & Simion, 2008). While it is not yet established what cues newborns use to carry out this task it is a skill that is clearly important for forming meaningful perceptual categories about face identity. Related to this kind of perceptual constancy, newborns' ability to recognize a face is enhanced by viewing the face undergoing smooth rigid head motion in the form of a left/right rotation, compared to viewing the same series of video frames presented out of order (Bulf & Turati, 2010). Despite this, it seems that not all motion is beneficial in this way. Neither the rigid (Guellaï, Coulon, & Streri, 2011) nor the non-rigid motion of a speaking face shown without sound is sufficient to promote recognition of a new face at this age (Coulon, Guellai, & Streri, 2011), possibly due to the complexity of the motion used in these studies compared to the rotational motion in the Bulf and Turati (2010) study. The addition of speech in concert with a moving face, however, appears to provide a newborn with the required information to look preferentially at their mother the first time they see her face in person (Sai, 2005) and even to recognize a stranger presented in a photo after audio-visual familiarization (Coulon et al., 2011). Many of the perceptual capabilities discussed at this age need not be strongly face specific, in that they could reflect basic biases toward key aspects of any visual stimulus that, when found in combination as in a face, make such stimuli very salient for newborns. From 2 months, however, infants begin to show face-specific perceptual effects. Between 2 and 6 months of age, infants demonstrate quickly changing perceptual effects from their experiences with faces in their environment. The abilities of infants around 2 and 3 months suggest their perceptual space has begun to reflect a foundational structure based on experience. This nevertheless remains extremely sensitive to variations between faces that adults, conversely, show difficulty in discerning. At around 2 months of age infants begin to show an eye movement scanning preference for the eye region of a face (Hainline, 1978; Haith, Bergman, & Moore, 1977). They have also been shown to prefer scrambled stimuli that retain the phase spectrum of natural faces (spatial frequencies occurring in the image are preserved but their phases are scrambled) and therefore look face-like to adults (Kleiner & Banks, 1987). Infants at 3 months are able to discriminate equally well between faces of their own race and also other races (Kelly et al., 2009; Kelly, Quinn, et al., 2007). At the same age, however, infants have developed preferences for faces similar to those in their most frequently encountered groups (see Sugden, Mohamed-Ali, & Moulson, 2014 for an analysis of an infant's most frequently encountered faces). They have a preference for faces of their primary caregivers' race (Kelly, Liu, et al., 2007; Kelly et al., 2005), and upright (but not inverted) faces that are the same sex as their primary caregiver, whether male or female (Quinn, Yahr, Kuhn, Slater, & Pascalis, 2002), as the primary caregiver is an infant's most viewed face (Bushnell, 2001; Sugden et al., 2014). Even more specifically, infants with female caregivers show a preference for female same race faces (Quinn et al., 2008). Interestingly, at 3 months infants' ability to discriminate between faces that are the same sex as their primary care giver is clear, but there is debate as to whether they can discriminate between individuals of the other sex. Thus, the statistics of the environment may indeed be playing an important role at this age and inviting the question of what role a small but significant exposure to a face category plays at this age (Quinn et al., 2002). This suggests, however, that infants at 3 months retain a flexible perceptual face space that has begun to acquire the statistics of their environment. Also highlighting the importance of the statistics of the environment and in particular the importance of sufficient variability in learning to categorize faces, 3-month-old Caucasian infants do not show evidence of a novelty preference to a new Asian face after habituating to a single Asian face but do show a novelty preference to a new face after habituating to just three Asian identities (Sangrigoli & De Schonen, 2004). That is, at least modest variation among individual faces during the familiarization phase may foster non-Asian infants' ability to show significant discrimination among Asian individuals' faces. Relatedly, at 3 months of age infants have also been shown to extract the commonality between sets of faces (de Haan, Johnson, Maurer, & Perrett, 2001; Rubenstein, Kalakanis, & Langlois, 1999). This effect is termed prototype extraction, where the average of a set of faces is responded to as being equally familiar as any of the experienced exemplars (Rosch, 1978; Rosch, Simpson, & Miller, 1976). Although findings like this are often interpreted to imply that face space is based around these prototypes, our interest in such results is that they also show that infants are sensitive to the statistical structure of their environment. Despite the beginnings of sophisticated face recognition skills, infants at 3 months are not yet showing all the hallmarks of adult face recognition. In particular, infants' categorical boundaries between faces are as yet quite fuzzy. Four-month-old infants' perception of the identity of morphed faces was tested and it was found that infants treat a morph that contains up to 70% of a face they had never before seen as though it were a familiarized face. That is, only 30% of a previously seen familiarized face was required in the morph for the face to be treated as familiar (Humphreys & Johnson, 2007). Between 6 and 9 months of age a perceptual narrowing to categories most often experienced seems to occur for faces, as is observed in the spoken word research. At 6-months infants still show a novelty preference for previously inexperienced, unfamiliar individuals of a race other than their own (as long as it is not too dissimilar: Kelly, Liu, et al., 2007) and even monkey faces (Pascalis et al., 2002). This suggests that as yet they can still differentiate individual members of face-type categories with which they have little or no experience. At 6 months of age, infants have also been shown to maintain a spontaneous preference for attractive faces (Rubenstein et al., 1999). These findings have typically been interpreted as evidence of prototype extraction. Indeed, when infants are habituated to three equally attractive individuals they show no recovery of habituation when presented with an average of the three faces but will preferentially look at a novel face (Rubenstein et al., 1999). An important observation from this study is that infants will maintain habituation to a prototype face and will actively respond to new faces, demonstrating not only the ability to form a prototype but also the proclivity to explore variability away from it, rather than to rehearse that prototype. Similarly, Heron-Delaney et al. (2011) showed that non-Asian children between 6 and 9 months old and growing up in Australia only needed 1 hr of exposure to individuated Chinese faces for apparent maintenance of the discrimination of other race faces. At 7 months of age, infants' response to morph stimuli containing mixes of two faces is showing signs of become more sharply tuned. As noted above, 4-month-olds responded to a 70% new face 30% familiarized face mixture as though it were a familiar face. In contrast, 7-month-olds only respond to an up to 50% new face mixture as though it is a familiarized face (Humphreys & Johnson, 2007). Between 9 and 12 months. In contrast to infants at 6 months, by 9 months of age infants show a reduced novelty preference for previously seen but untrained individual monkey faces (Pascalis et al., 2002) and seem to only differentiate individual humans of their own race (Kelly et al., 2009; Kelly, Quinn, et al., 2007) unless provided with experience of faces from another race (Anzures et al., 2012). Additionally, by 9 months old infants have been shown to demonstrate integration between internal and external facial features only when presented with an upright but not an inverted own-race face (Ferguson, Kulkofsky, Cashon, & Casasola, 2009). They also demonstrate the ability to form categories according to race but to discriminate only among individuals categorized as own-race (Anzures, Quinn, Pascalis, Slater, & Lee, 2010). This indicates tuning based on the available input and the establishment of the basic foundational structure of the perceptual face space that is similar to that found in adult studies. One question is whether the apparent perceptual narrowing represents a time governed or experience governed developmental window (Maurer & Werker, 2014). In consideration of this, the apparent perceptual narrowing can be reversed with meaningful exposure to categories of faces not seen in the environment. For example, Pascalis et al. (2005) showed that training with a small set of individually named macaque faces was sufficient to prevent the loss of discrimination ability for macaque faces seen at 9 months. The progression of this apparently reversible perceptual narrowing is not yet sufficiently mapped out to understand concretely the process and timeline of perceptual narrowing involving the range of facial judgments considered here (see Maurer & Werker, 2014, for a review). However, it is apparent that judgments involving rarely-to-never experienced categories become more difficult with age in infancy. Just as has been found in speech perception, we anticipate that the perceptual narrowing in face recognition is accompanied by a concomitant perceptual elaboration that can support additional and more advanced perceptual constancies. This elaboration will be guided not only by better understanding of the statistics of the environment but also by cognitively abstracted categories reflecting social and cultural factors that provide feedback about socially relevant categorizations. This is an area that is as yet under-explored, given that studies of face recognition mostly do not use multiple images of the same face to probe whether infants can recognize constancy of a given face across various transformations (i.e., across different emotional expressions, lighting conditions, dynamic changes over time, etc.). Beyond 12 months, although a direct analogue of the kind of perceptual constancy measured in the spoken language domain (with clear recognition of words across accents and speakers) has not yet been investigated, and thus cannot yet be assumed within face recognition, it may be that studies of infants' recognition of a particular person across changes in makeup, dramatic changes in hairstyle, emotional expressions, spatial perspective, or even changes in lighting conditions could demonstrate the development of quite sophisticated perceptual constancies in this domain between 15 and 19 months of age, as well. As noted above, to date, little or no research has been done on this aspect of face recognition.

MULTI-SENSORY3 INTEGRATION OF FACE AND SPEECH

From these brief reviews of the developmental trajectories of both face and spoken word processing it is clear that there is a generally common pattern, specifically a move from a very broad yet apparently unstructured ability to match and differentiate a range of features of speech and faces, toward more experience-dependent capabilities. This shift encompasses a perceptual narrowing away from non-experienced aspects, as well as a very likely, yet under-explored, elaboration of constancies within commonly experienced aspects, within the first year of life. It is also apparent, however, that the spoken word and face recognition literatures are based overwhelmingly on uni-modal studies—the auditory modality in the case of speech, and the visual modality in the case of faces. Moreover, research on infant perception of words and faces has often focused on the different types of information that can be gained from the stimuli. For example, face recognition research is often focused on the development of the recognition of identity, an indexical and constant aspect of the one face. On the other hand, spoken word/phoneme perception research is often focused not on the indexical aspects of the voice (e.g., who is talking) but on what words that voice is conveying. This makes the similarities striking but it does also make the two literatures difficult to truly compare. Therefore, a key area where the strength of a common developmental mechanism should be in evidence is when the multi-sensory aspects of face and speech perception are considered in concert, particularly in the context of face-to-face interactions. It cannot be ignored that when interacting with a person an infant's experience is typically audio-visual. This is particularly significant in developmental research because there is a considerable amount of redundancy in the audio-visual stimulus that can be important to the development of the uni-modal perceptual capabilities. Although many studies present faces (visual) and voices (auditory) in isolation (i.e., uni-modally), an infant is more regularly experiencing live, visible + audible people speaking. This means that they experience the combination of multiple modalities, where a multitude of cues to the same information are present. Studies into infants' ability to capitalize on audio-visual cues suggest that the developmental trajectory of face and spoken language perception are closely intertwined, as they would need to be to take advantage of the powerful multi-modal and amodal cues in the natural environment. When presented with a person speaking there are several aspects of both the visual and the auditory stimulus that are shared across the modalities, in particular, onsets and offsets, the duration, tempo, and rhythm of the two modalities of talking faces (Yehia, Rubin, & Vatikiotis-Bateson, 1998). This information is redundant in that the exact same information can be gained in a fairly equivalent manner across the two senses and it therefore provides an unambiguous cue to aid integration across modalities. Newborns display what appears to be surprising capability in multi-sensory perception, which clearly suggests that the type of information that is redundant in audio-visual speech is a very important cue supporting the development of both face and spoken word perception. For example, newborns at 3 weeks of age have been shown to spontaneously match audio-visual stimuli (a white light and audio white noise) based on the relative intensity of the stimuli. This was measured via the newborn's cardiac response, which differed systematically depending on whether the relative intensity levels were similar or notably different between the audio and the visual stimulus (Lewkowicz & Turkewitz, 1980). They have also been shown to match monkey facial gestures with vocalizations, with the evidence strongly suggesting they do this on the basis of the synchrony of onsets and offsets of the audio vocalization and the facial gesture rather than matching the quality of the complex sound to the shape of the mouth (Lewkowicz, Leo, & Simion, 2010). These multi-sensory perceptual abilities can be seen to match an infants' physical acuity capabilities across the senses, being driven by basic amodal (non-specific to a given modality) properties present in natural stimuli, such as the direction and speed and start/stop of a moving, sound-making object. Indeed, it has been proposed that the ability to match audio-visual stimuli at this age is due to the young infant's inability to differentiate reliably among the individual sensory modalities of a multi-modal stimulus, rather than reflecting a particular ability to associate across the two modalities of audio-visual stimuli (the Infant as Synaesthete theory; Maurer, 1993; Maurer & Mondloch, 2004). To the extent that young infants show no evidence of differentiating among modalities, then experience with the amodal properties of stimuli that newborns experience naturally, such as the onsets and offsets of audio-visual speech, should be extremely important in driving the development of the separate modalities' processing capabilities. The theory that infants do not differentiate the senses at birth also highlights the importance of taking a whole brain/integrated perspective, rather than a modular view of development of independent perceptual modalities within the first year of life. It clearly suggests the importance of an integrated and domain- and modality-neutral set of mechanisms in the development of skilled perception in infants. The infant synesthesia theory also accords with the suggestion that redundantly specified stimuli, that is, stimuli that specify the same information through multiple modalities, should be strongly attention grabbing/salient for infants (see Bahrick & Lickliter, 2012). To the extent that the redundant aspect of an audio-visual stimulus is dealt with similarly at a neural level (i.e., increased intensity resulting in increased firing), an intrinsic connection between the sensory areas of the brain would ensure that this aspect of the stimulus is activating these two separate sensory areas in concert with each other, making the power/salience of the stimulus greater in effect. It is plausible that from this base the infant is able to begin to experience the features of their multi-modal environment that are not redundantly specified. That is, the patterns that are statistically related across the senses provide a foundation from which to contrastively experience those aspects of a stimulus that are uni-modally specified. Between 2 and 6 months. From about 2 months of age infants are beginning to respond to audio-visual stimuli based on experience gained within their first months. At 2 months of age infants will respond differentially to multi-modal, moving faces depicting different emotions and in particular they also mirror (“imitate”) expressions of joy and sadness presented to them (Haviland & Lelwica, 1987). The multi-modal nature of these stimuli is considered crucial to the discrimination of emotions at this early age (for a review, see Walker-Andrews, 1997). Additionally, 2-month-olds can also match some vowel sounds to the facial motion used to produce the sound (Patterson & Werker, 2003). It has been proposed that infants at this young age are matching the sound and facial gesture on the basis of the full spectral and amplitude properties of the stimulus, as even infants as old as 4.5 months do not show evidence of matching vowels on the basis of only simplified temporal or amplitude changes (Kuhl & Meltzoff, 1984; Kuhl, Williams, & Meltzoff, 1991). By 3 months of age it has been found that infants are able to associate new people's faces with their voices, looking longer at novel combinations of recently familiarized faces and voices (Brookes et al., 2001). They have also been shown to search visually for a parent's face when they hear the parent's voice unaccompanied by their face (Spelke & Owsley, 1979), suggesting that an association between the identity of a face and voice is established early. Infants' sensitivity to the correspondences between audio and video (talking face) presentations of specific vowels and consonants is such that 4.5-month-old infants look significantly longer at a face whose articulation matches a synchronously played audio vowel, when two videos of the same face articulating two different vowels are presented synchronously side-by-side (Kuhl & Meltzoff, 1982, 1984). In support of the recognition of vowels across modalities, and as a strong demonstration of the multi-sensory nature of developmental learning, 3- to 4-month-old infants have been shown to imitate facial movements articulating vowels when the vowel sound is paired with the corresponding facial motion but not when the auditory stimulus does not match the visual facial motion (Legerstee, 1990). Infants also recognize articulatory correspondences between seen and heard speech syllables when the two modalities are presented completely separately rather than simultaneously. Several recent studies assessed infants' recognition of multi-modal consonant correspondences in a task involving familiarization to one of two contrasting audio-only syllables, followed by a test phase in which infants' looking preferences were assessed to silent videos of a speaker producing syllables that matched versus mismatched the preceding audio consonant. Four-month-old infants fixated longer on the face whose articulations corresponded to the preceding audio stimuli, for both native and non-native consonant contrasts (Best, Kroos, & Irwin, 2010, 2011; Pons, Lewkowicz, Soto-Faraco, & Sebastián-Gallés, 2009; see also Bristow et al., 2009, for ERP evidence of such multi-sensory sensitivity in 2.5-month olds). At 4 months of age, infants also still match monkey calls to their facial gestures, looking longest at the facial gesture matching the monkey call (Lewkowicz & Ghazanfar, 2006). As yet, however, at 4.5 months infants have not been found to take sex of a face into account when matching vowels across modalities. When two articulating faces are presented side-by-side infants at this age will apparently ignore a mismatch in sex and match according to the corresponding vowel sound (Patterson & Werker, 2003). When presented with audio-visual IDS (infant-directed speech), from 4 months of age infants are able to detect when changes in the lexical-syntactic content, in a speaker's sex, or in synchrony occurs in any modality (auditory, visual, or audio-visual). Interestingly, when presented with ADS infants at both 6 and 8 months have not shown evidence of detecting the same changes when they occur in the auditory domain only, despite being able to detect them in the visual-only and audio-visual modality (Lewkowicz, 1996). This not only highlights the multi-sensory capabilities of infants but also the importance of the properties of infant directed interactions, such as the presence of systematic co-variation in the input. This aspect will be discussed below (see the Variability in Infant Directed Spoken Interactions Section). At 5 months of age, infants have been shown to associate the audio with the matching visual component of an audio-visual presentation of a consonant-vowel-consonant-vowel string, preferring to view stimuli that matched in phonemic content and not just synchrony (MacKain, Studdert-Kennedy, Spieker, & Stern, 1983). At this age, infants are also able to learn abstract patterns created by systematically paired looming visual objects and auditory syllables. Infants did not learn the pattern when the syllables were presented without a visual stimulus or when the syllables were paired with objects unsystematically. This demonstrates rule learning, which at this age appears to be driven by the systematic relationship between the combined sensory inputs (Frank, Slemmer, Marcus, & Johnson, 2009). Related to this finding, by 5 months of age infants are able to recognize the correct association between static human versus monkey faces and human speech sounds versus monkey calls, despite not having any particular experience with monkey sounds (Vouloumanos, Druhen, Hauser, & Huizink, 2009). At this age, infants also show evidence of integrating conflicting audio-visual presentations of speech phonemes such that they appear to experience the McGurk effect, in which a synchronously presented visual va/audio ba is heard as a va, just as adults do (Rosenblum, Schmuckler, & Johnson, 1997). Between 6 and 9 months. Up to 6 months of age, infants have been demonstrating a preference for cross-modal stimulation and an increasing ability to carry out complex perceptual tasks across modalities. Yet infants less than 6 months do not show evidence of a decline in ability to carry out tasks with speech categories they have not experienced in their native language environment. For example 6-month-olds are able to match non-native consonant contrasts (/b/ and /v/ for Spanish learning infants) across separate auditory and visual presentations (Pons et al., 2009). Yet, as further evidence of beginning to associate aspects of the face with aspects of a voice that are not redundantly specified, at 7 months of age infants can match the emotion of a face and voice across separate presentations of the two modalities (Walker-Andrews, 1986). Moreover, at 8 months infants are able to associate the sex of a face with that of a voice when the voice is articulating the same vowel as the face (Patterson & Werker, 2003). There is also some evidence of perceptual narrowing of cross-modal perceptual abilities in infants older than 6 months of age. While 6-month-olds are able to match monkey facial gestures with their associated call, when tested at 8 months infants no longer show evidence of making this match (Lewkowicz, Sowinski, & Place, 2008). It is hypothesized that at this age infants are no longer relying on basic/amodal aspects of the stimuli to carry out these kinds of tasks, and that without continued experience with cross species perception the task becomes increasingly difficult (Lewkowicz et al., 2008). At this age, multi-modally redundant stimuli still appear to capture attention. Crucially, though, such redundant speech streams also appear to aid subsequent recognition of words that occurred in the stream. At around 7.5 months of age, for example, when infants were familiarized with two simultaneous, competing audio speech streams, they subsequently recognized words from one of the streams if the video of that speaker had been presented synchronously with that stream during familiarization (Hollich, Newman, & Jusczyk, 2005). Between 9 and 12 months there is continued evidence of further perceptual narrowing or experience-based elaboration for multi-modal stimuli. Eleven-month-olds recognize a match between separately presented audio-only followed by visual-only presentations of a native consonant contrast, but they do not show evidence of this for certain non-native consonant contrasts such as ejective stops from the Ethiopian language Tigrinya (Best et al., 2010, 2011; Pons et al., 2009), although they succeed with other crucially different non-native consonant contrasts that are categorized by adults as non-speech sounds, that is, click consonants from the Botswanian language !Xòõ (Best et al., 2010, 2011). The results across that set of studies indicate that by 11 months infants have become perceptually attuned to detect just those multi-modal articulatory correspondences that are relevant to native speech contrasts. Between 10 and 12 months, infants also demonstrate the emergence of the ability to match the identity of their native language across modalities for connected speech. However, they do not show evidence of doing this with an unfamiliar language (Lewkowicz & Pons, 2013). In summary, the evidence of a progression toward becoming a skilled perceiver of audio-visual social interactional stimuli suggests that in newborns the multi-modally redundant aspects are extremely important. Given some experience with audio-visual faces, infants begin to extract statistics of the world in relation to commonly co-occurring aspects of the stimuli. As time progresses they become able to recognize and learn associations between increasingly more complex multi-modal patterns. At the same time as these increasingly more sophisticated associations are emerging around 9–12 months, infants' ability to recognize cross-modal matches for rarely encountered classes of audio-visual stimuli that they could/did detect early in the first year, appears to decline just as has been found in the auditory and visual domains separately (see Lewkowicz & Ghazanfar, 2009).

SOME CAVEATS ABOUT COMMON DEVELOPMENTAL PROGRESS ACROSS SENSORY DOMAINS

We propose that the acquisition of perceptual skill may depend on common developmental and representational mechanisms that could be expected to cause developmental milestones to be reached at the same age across domains. Despite the apparently similar developmental trajectory of the perceptual skills we have reviewed here, with perceptual narrowing occurring toward the end of the first year of life in both domains, we do not necessarily expect to find developmental milestones in lock step across domains. Note also that alongside the apparent narrowing by 12 months, we also predict perceptual elaboration by this same age, despite there still being insufficient research on this issue to date. Moreover, the posited common perceptual development mechanism across the two domains will still need to be implemented by, or interact with, the individual neural machinery and the primary input sensory modality(s) of the domain in question. This could lead to different timeframes for the emergence of similar developmental milestones across faces and spoken words. For example, although the retina is not considered to be important in the representation of faces within the visual system, it is nonetheless necessary to visual perception of faces. Its functional maturity will affect the data available to carry out statistical learning about faces. The retina, and the visual system more generally, develop at a much slower rate than the auditory system in the fetus (Gottlieb, 1971; Lickliter, 1993), with the auditory system structurally complete and functional much earlier prenatally. Additionally, the input for spoken language learning is available in the womb, which cannot be said of the visual input necessary for learning to recognize faces (see Lickliter, 1993). Therefore, development of auditory skill may appear to precede analogous visual skill simply because the auditory system itself has been receiving relevant data from the final prenatal trimester whereas the visual system processes little data prior to birth. That is, the input data collected may yet be insufficient in one domain (e.g., vision), while in another (e.g., audition) sufficient data has already prompted the next stage in development.4 As can also be seen after reviewing the developmental literature, the myriad of perceptual decisions that can be made when considering the audio and visual aspects of a face, including a talking face, make finding truly analogous perceptual capabilities across faces and spoken words challenging. Moreover, the multi-modal nature of the natural stimuli and the demonstrated importance of multi-modal stimulation to infants strengthen the case for a common mechanism, yet at the same time complicate our ability to design experiments, and to draw conclusions from prior research within single sensory modalities. Without expecting to be able to draw exact milestone comparisons between domains, comparisons of the separate and combined developmental trajectories outlined above is suggestive of a similar development process.

SUMMARY OF DEVELOPMENT

Based on the developmental trajectories of both face and language perception, it appears that at birth infants possess a largely untuned and basic perceptual space capable of differentiating many properties of both faces and speech whether presented in concert (audiovisual talking faces) or separately. Through the first approximately 6 months infants' sensory exploratory behaviors appear to be biased toward collecting data about the statistical structure of the particular stimulus environment within which the infant is immersed. The statistical representation of the basic dimensions of the sensory environment then forms a foundation that is thereafter a base from which perceptual constancies are established and crucial abstractions can be perceived. A skilled perceiver is able to both make judgments about very fine scale differences, and to tackle constancy problems that go beyond the basic surface statistics of the detailed input. The transformation from statistical learner to abstraction learner appears to coincide with the onset of perceptual narrowing across both the face and speech domains. However, we propose that rather than losing perceptual capability, the process of perceptual narrowing represents the transition from basic statistical perceiver to an abstraction learner and could more comprehensively be conceived of as a time of perceptual elaboration. Infants become able to deal with more abstract regularities in their environment, such as the constancy of a word's phonological structure and meaning despite a change of emotional affect or speaker or accent, or the facial identity of a person despite a change in hairstyle or makeup or emotional state. While similarity of the developmental trajectory of face and spoken word perception and the crucial multi-sensory aspects of these kinds of stimuli is striking, these apparent similarities could be driven by dissimilar developmental principles. However, other evidence can also be brought to bear to bolster the claim for a common developmental principle behind both.

OTHER EVIDENCE FOR A COMMON PRINCIPLE BEHIND PERCEPTUAL DEVELOPMENT FOR FACES AND SPOKEN WORDS

While we have outlined a similar developmental trajectory as evidence for a common underlying mechanism, the evidence used to support the neuronal recycling hypothesis (Dehaene, 2005) as a common neurodevelopmental process for any “human cultural ability” is also compatible with our theory. The neuronal recycling hypothesis proposes that any apparently unique and recent cultural ability that humans exhibit must reflect an incremental use of flexibility already present in the brains of our nearest ancestors. The development of “human abilities” is therefore ultimately constrained by genetically controlled factors such as receptor density and connectivity patterns. It is not the case, in this scheme, that any and all regularities can be learned. Only those regularities that the brain is set up to be able to learn are possible. Dehaene and Cohen (2011) argue, for example, that the visual word form area is an example of a common visual area with a suitable basic visual purpose (preference for high resolution foveal shapes and for line images) that can be co-opted in the human brain to undertake reading and face recognition. To build on the idea of a common underlying mechanism, we speculate that rather than reinvent totally new systems of learning for each perceptual domain, the same basic neurophysiological processes are recycled throughout the brain and “implemented” when the learner engages with a stimulus that requires the kind of fine grained discrimination and perceptual constancy across variable instantiations that spoken word perception and face recognition demand. This results in a developmental trajectory and a final representational structure that appears similar across different domains. It follows that there would be a range of neuroanatomically distinct regions that appear to operate similarly but manipulate different input. This view, that one fundamental mechanism or set of mechanisms is responsible for the development of all skilled perceptual behaviors, is also supported by the evidence put forward for the co-development of lateralization of printed word and face responsive areas of the brain (Dundas, Plaut, & Behrmann, 2013). Lateralization of function has been shown to be flexible and to adjust to provide a systematic, and compartmentalized, relationship between differing functions. For example, developmental emergence of lateralization of the visual areas of the cortex that represent faces and printed words has been shown to be inter-related, such that face recognition becomes more strongly lateralized to the right hemisphere as printed word recognition develops and becomes lateralized to the left hemisphere (Dundas et al., 2013). This seemingly paradoxical finding suggests that when a cortical region normally devoted to one function comes under competition from a later-developing function, the brain's response is to increase modularization of the two functions, in this case through increasing the lateralization of the two functions to the opposite hemispheres. Although this example of mutual competition only encompasses the visual modality, we speculate that allocation of resources across many areas of the brain is a closely interwoven process of ongoing organization even outside of visual perception. In particular, this type of mutual competition driving apparent modularization of the brain would be crucially important considering the naturally multi-sensory nature of both speech perception/spoken word recognition and recognition of individuals by their faces and voices. Indeed we can look at skilled perceptual capabilities through the lens of the information extracted rather than the modality of delivery, and in so doing we begin to identify regions of the brain that are clearly not “modal” (not unimodal). For example, Haxby, Hoffman, and Gobbini (2000) outline a proposal for a distributed neural system responsible for face perception. They suggest that there are two “streams” representing the invariant aspects of faces that facilitate identity recognition versus the changeable aspects of faces that facilitate social communication, respectively. These two “separate” aspects of face perception are equally applicable to voice perception. Indeed, the areas of the brain that have been found to be most responsive to the changeable aspects of faces are located in the superior temporal sulcus (Hoffman & Haxby, 2000; Puce, Allison, Bentin, Gore, & McCarthy, 1998). These aspects of face perception share a neighborhood with the area most responsive to spoken language, in the superior temporal gyrus (Calvert et al., 1997). Moreover, integral to the distributed system for face perception is the inclusion by Haxby et al. (2000) of areas of the brain considered to subserve “non-face” cognitive functions, particularly where the same information can be gleaned from either the voice or the face. As an example, lip reading is found to elicit activity in areas associated with processing auditory speech (Calvert et al., 1997). The explicit inclusion of “non-face” brain regions (particularly auditory language related areas in our case) within the proposed face perception system acknowledges the multi-modally integrated nature of the stimuli that carry social information and the ultimate efficiency of harnessing a distributed yet integrated system to process these stimuli. It is important to also highlight that we need not be restricted to consideration of the integration of “receptive” senses. Proprioceptive information, or the awareness of how our own face moves, may also be important in the development of both face recognition (Sugita, 2009) and speech perception capabilities (Ito, Tiede, & Ostry, 2009; Skipper, van Wassenhove, Nusbaum, & Small, 2007). To arrive, as adults, at the distributed system subserving skilled perception that is outlined by Haxby et al. (2000), the parsimonious suggestion would be that the same developmental mechanisms are at play across the senses, shaping the sensory brain to ensure independent functioning of each sense, but also integration between senses, according to the statistics of the environmental (and self-generated) input. Following this line of thought, the existence of a basic, generally available learning mechanism would promote the reallocation of an area of the brain to an unusual role in the absence of a stereotypical sensory diet in early development, as happens in cross-modal sensory plasticity with early impairments in hearing or vision (Wong & Bhattacharjee, 2011; see also Shimojo & Shams, 2001). Finally, additional support is also found in evidence that abilities we once thought made us unique among animals are most likely an adaptation and elaboration of a more general organizational principal, used to supreme effect in spoken language and face recognition. In particular, other animals display statistical learning (Hauser, Newport, & Aslin, 2001). There are similarities on many levels between birdsong and human speech (Fehér, Wang, Saar, Mitra, & Tchernichovski, 2009; Gardner, Naef, & Nottebohm, 2005) and we are not alone in our ability to recognize individuals via the face (Martin-Malivel & Okada, 2007; Peirce et al., 2001). By outlining the evidence for a common developmental mechanism/mechanisms that may subserve skilled perceptual capabilities across the two domains, hints emerge regarding which key aspects of the sensory input promote successful development of skilled perception of faces and spoken words and phonemes. Our proposal is that structured variation in natural face and speech input, in particular, is crucial to the development of skilled perception.

WHY IS VARIATION IMPORTANT?

No matter what it is we are trying to categorize or discriminate, there will always be variability in the input. Even in the ideal circumstance where the environment is controlled such that the physical input is unchanged (as in laboratory studies), every time we encounter an instance of a word or face, internal noise (e.g., in background-level neural firing) will ensure that there is variability in the internal representation. Natural variability in the input may, at first glance, appear to be particularly challenging for infants. However, although some variability will be random and uninformative, in many cases the variability will be quite systematic (even if it is also quite substantial), particularly when it comes to identification of a specific word across speakers with differing accents or identification of a particular person across changes in pose. For this reason, it is important for the perceptual systems to become familiar with the natural systematic versus random variability within and between the categories of stimuli that are important (see Best, in press; Bruce, 1994; Burton, 2013; Hay & Drager, 2007). Both the face and spoken word and speech perception literatures acknowledge the utility of organizing perceptual spaces based on variability. For example, principal component analysis (PCA: Jolliffe, 2005) methods have been useful in modeling human face recognition abilities in adults (Furl, Phillips, & O'Toole, 2002). PCA creates a face space by describing a set of faces as dimensions ordered according to explained variance. Despite the utility of these models in describing some aspects of adult face recognition, a standard assumption is that the infant must learn about random and systematic variability in order to discount it, helping to establish a normalized and “invariant” representation of the particular thing being identified (see also, Bruce, 1994). That is, it is tacitly assumed that variability of all kinds is a hindrance to recognition and classification, and that it needs to be filtered out or discarded. That view would suggest that it is optimal to initially present an infant with clean (low-variability) data in order to optimize their ability to establish ideal exemplar traces (or to develop clean category prototypes). In that approach, only then should finer scale variability be introduced to flesh out the representation (Papousek, Papousek, & Bornstein, 1985; Snow & Ferguson, 1977). One important counter example to the “normalization” view is the Perceptual Learning Theory of Eleanor Gibson (1969) (see also J. J. Gibson & E. J. Gibson, 1955) who proposed that rather than establishing prototypes or ideal exemplars of a category, per se, perceptual learning progresses by establishing dimensions of difference. That is, perceptual learning essentially involves coming to recognize the contrastive aspects of stimuli, or elaboration as we have mentioned. E. Gibson (1969) also stressed that learning of differences is boosted when distinctive features are emphasized. Rather than needing a clean, normalized prototype for perceptual learning, the Perceptual Learning Theory view recommends that useful differences should be emphasized. When this is translated to multi-dimensional stimuli like words and faces, in real life the natural input that supports infant learning should be highly variable along a range of dimensions. This will simultaneously enhance differences that should serve perceptual learning across a range of uses. Indeed, data suggest that in development of both face and spoken language perception, the general pattern observed in caregivers' behavior toward young infants is that it presents a wider range of systematic variability along multiple stimulus dimensions than is seen in adult-adult communication, rather than a reduced range of variability.

VARIABILITY IN INFANT DIRECTED SPOKEN INTERACTIONS

In the language domain, findings on the audible properties of IDS indicate that a number of crucial acoustic properties of IDS are both exaggerated in range and more variable along a number of important dimensions, relative to ADS (see Best, in press). If variability made initial language learning difficult, then social-cognitive and/or evolutionary principles should push parents to reduce phonetic variability when speaking to their infants (cf. Papousek et al., 1985; Snow & Ferguson, 1977). Instead, caregivers and other people are apparently compelled to expand phonetic variation along multiple dimensions when interacting with babies. IDS, as compared to ADS, displays a larger magnitude and range of excursions in pitch (F0) (e.g., Fernald & Simon, 1984; Kitamura & Burnham, 2003; Kitamura, Thanavisuth, Burnham, & Luksaneeyanawin, 2002), in structured temporal variations (e.g., rhythmic alternation, durational contrasts, speaking rate), and in dynamic adjustments of voice amplitude/intensity, which range from loud to modal to whispered (e.g., Fernald & Mazzie, 1991; Fernald et al., 1989; Grieser & Kuhl, 1988; Kitamura & Lam, 2009; Stern, Spieker, Barnett, & MacKain, 1983). These modifications of pitch, timing and amplitude, in turn, impact on linguistic features such as stress patterning and prosodic modulations that reflect both grammatical structures and pragmatic aspects of discourse (e.g., turn-taking). Variability and range in vowel formant frequencies (F2, F1) is also exaggerated in IDS (Burnham, Kitamura, & Vollmer-Conna, 2002; Kuhl et al., 1997), as is variation in more socially relevant acoustic properties such as emotional affect (Slaney & McRoberts, 2003; Trainor, Austin, & Desjardins, 2000). Moreover, babies prefer and attend more to the increased variation of IDS relative to ADS (Cooper & Aslin, 1990; Fernald, 1985; Fernald & Kuhl, 1987; Kitamura & Lam, 2009). Thus, IDS displays an increased range of variation along multiple acoustic dimensions that are relevant to both the linguistic and social aspects of early language acquisition, and that variability appears to capture infants' attention rather than overwhelming them. Recent research has provided evidence consistent with our reasoning that increased acoustic variation in speech helps rather than hinders infants' learning of speech distinctions and spoken words. Infants discriminate vowels better if the stimuli are presented in a variety of pitches than if they are presented in only a single high pitch (Trainor & Desjardins, 2002). Infants of 14 months can learn a novel minimal-pair word distinction (/buk/-/puk/) if the training tokens are produced by multiple speakers, but not if they are produced by just a single speaker (Rost & McMurray, 2009, 2010), and toddlers of 21 months can learn new words presented in IDS but do not show evidence of this with ADS (Ma, Golinkoff, Houston, & Hirsh-Pasek, 2011). Moreover, even infants as young as 7.5 months recognize familiarized words better if they were originally presented in IDS than in ADS (Singh, Nestor, Parikh, & Yull, 2009). They also recognize familiarized words better if they were originally presented in multiple emotional affects (happy, sad, neutral) than in a single affect (Singh, Nestor, et al., 2008). Likewise, 7-month-olds are better able to segment familiarized words from sentences if the words were originally presented in IDS than in ADS (Thiessen, Hill, & Saffran, 2005). Infants whose mothers use wider acoustic variations in their vowels in IDS perform better on discriminating native speech contrasts than do infants of mothers who display less variation in their IDS vowels (Liu, Kuhl, & Tsao, 2003). Crucially, infants' speech discrimination performance predicts later word-learning: better discrimination of native speech contrasts and poorer discrimination of non-native speech contrasts at 7 months both predict larger vocabulary size at 14–30 months (Kuhl, Conboy, Padden, Nelson, & Pruitt, 2005; Tsao, Liu, & Kuhl, 2004). In complement to these observations, studies of the acoustic variations in IDS vowels indicate that caregivers provide systematic distributions of those variations that help to distinguish between relevant vowel contrasts in their language (Werker et al., 2007).

VARIABILITY IN INFANT DIRECTED FACIAL INTERACTIONS

But do these beneficial effects of input variation also apply to the visible motions of a speaker's face when she is interacting with infants as compared to adults? Is there also a greater range and more variation in adults' facial motions during infant-directed than adult-directed interactions? If so, does this aid infants' perceptual learning about faces? Although informal observation and general belief would suggest that this is surely the case (e.g., Werker, Pegg, & McLeod, 1994), there has been remarkably little research addressing these questions. What little evidence we could find, nevertheless, does indeed indicate that there is more extensive (systematically increased variation in) speech-related facial motion in infant-directed interactions than in adult-directed ones. Infants also seem attracted to and/or benefit when they are exposed to variable facial patterns, including dynamic variation (e.g., videos rather than still pictures). Infant directed facial speech exhibits more extensive lip movements for vowels than does adult-directed facial speech (Green, Nip, Wilson, Mefferd, & Yunusova, 2010; Kim, Davis, & Kitamura, 2012). Moreover, three-dimensional measures of head, eyebrow, jaw, mouth, and lip motion more generally support the same pattern of greater motion in IDS than ADS (Chong, Werker, Russell, & Carroll, 2003; Kim et al., 2012). Relatedly, caregivers' manual gestures and motions involving objects are more extensive and varied in infant-directed than adult-directed verbal interactions about those objects (Brand, Baldwin, & Ashburn, 2009). And with respect to early development of face perception, research suggests that dynamic stimuli (e.g., videos of facial expressions) benefit infants' ability to discriminate facial emotional expressions (Caron, Caron, & MacLean, 1988; Walker-Andrews, 1986; see also Walker-Andrews, 1997) and may better support their learning and recognition of familiar faces (Cecchini, Baroni, Di Vito, Piccolo, & Lai, 2011; Layton & Rochat, 2007). Infant sensitivity to time-varying audio-visual correspondences in IDS versus ADS would provide further support for our premise that systematic variability is crucial to perceptual learning. Despite stimulus variability being essentially doubled across the two modalities of audio-visual speech, relative to uni-modal audio speech, multi-modal studies have found that infants' preference for IDS over ADS speech is robust when the speech signal is synchronously presented with the video of the speakers. However, the IDS preference reduces or disappears if synched with a video of the speaker simply nodding or producing ADS (Werker & McLeod, 1989; see also Lewkowicz, 1996). This pattern holds up regardless of whether the audio stimulus materials are native or non-native speech (Werker et al., 1994). Together these results suggest that the dynamic correspondences between the increased variation in each modality of IDS guides the infant's attention to multi-modally informative aspects of speech. According to the results outlined, increased yet systematic variability is pervasive along a number of stimulus dimensions in infant directed social interaction. The evidence also suggests that stimulus variation is crucial to the development of speech and word perception skills. Indirect evidence leads us to suggest that the same is true of face recognition. Given the task of categorizing or distinguishing faces or words across different instances, we suspect that the increased variability of infant directed interaction is central to supporting the rich categorical knowledge the infant must acquire for faces and spoken words. Moreover, we argue that knowledge of informative variability (i.e., systematically organized) is maintained as crucial information about the complex statistical regularities of the natural input in these two domains. That is to say, the key representational strategy for the perceptual spaces the infant is developing is not the extraction of constrained and clean prototypes or averaged representations for facial or language categories. Rather it is the extraction of the acceptable variance within and relationships between categories (Best, in press). Complementary evidence supporting the role of variability in achieving skilled recognition comes from a modeling study looking at development of the other race effect (Balas, 2012). The other race effect is the diminished recognition memory we experience for individual faces from races other than those in our own environment (see above for the development of this perceptual narrowing). The study used a Bayesian estimation of recognition performance after a PCA based model had been trained on faces from one race or two. One key aspect of the model is that it was trained using “difference images” rather than images of individuals. A difference image was constructed by calculating the pixel-by-pixel difference between images of individuals. What this means is the model was trained using variability as the input, not using exemplars. As the number of faces that were used to create the training images increased, the face discrimination performance of the model increased, showing that increased variability can support better discrimination performance. The key manipulation was whether the model was trained using difference images created between different races or not. The model trained with difference images created across race boundaries (other race training) developed an ability to discriminate individuals within both the minority and majority race faces. The model trained without cross race difference images produced less discrimination between other race faces. While acknowledging that infants and the model start from different baseline perceptual spaces and with little expectation that we should find this particular model implemented within the brain, the model makes an important contribution: namely that the kind of variability in the training images shapes the performance of the model. Moreover, where increased variability was learned, overall superior performance resulted. We suggest this is also important to consider when looking at the perceptual development of infants in the first year of life.

BILINGUAL DEVELOPMENT: IS INCREASED VARIABILITY HARMFUL?

A key question, then, is whether there can be too much variability. One way to address this is research into a population who experience relatively more variability, such as infants who are born into a bilingual environment. Bilingual-learning infants receive the added variability of regularly encountering two (or more) languages in their input. Thus, it is important to ask whether bilingual-to-be infants' speech perception and spoken word recognition skills are impeded, unaffected, or enhanced—the latter as our proposition would predict—by this increased yet linguistically systematic variation. A recent burgeoning of research on bilingual versus monolingual infants makes it possible to begin answering that question. Depending on the task used, on the abilities tested, and possibly on aspects of their language environments, all three patterns of difference between bilingual and monolingual infants' performance have been reported. Nonetheless, consideration of the full array of findings suggests that bilingual infants are well able to accommodate the extra cross-language variability in their speech input, and even show benefits in speech perception and in certain non-linguistic perceptual-cognitive skills, relative to their monolingual peers. For example, newborns of bilingual mothers show listening preferences for both of her languages as compared to non-native languages, and can also discriminate between the two maternal languages (Byers-Heinlein et al., 2010) even if they are from the same rhythmic class, unlike monolingual-to-be newborns. As for attunement to the phoneme contrasts of their two native languages, early findings indicated a modest temporary decline around 8 months in bilingual infants' audio-only discrimination of speech contrasts in both their native languages, in contrast to the good native speech contrast discrimination observed in their monolingual peers of each language. However, this decline was temporary. The bilinguals regained good discrimination of contrasts in both of their languages within the following month or two, by 10–14 months of age (e.g., Bosch & Sebastián-Gallés, 2003). Importantly, though, more recent evidence from a range of studies indicates that if more sensitive testing techniques are used, bilingual infants do discriminate the contrasts of both languages across the full age range including the 8- to 12-month period (Albareda-Castellot, Pons, & Sebastián-Gallés, 2011; Burns, Yoshida, Hill, & Werker, 2007), as well as outperforming monolingual infants in discriminating consonant differences between their languages (Sundara, Polka, & Molnar, 2008). They also outperform monolingual peers at 8 months in discriminating between their two languages when presented with only silent-video talking faces (Weikum et al., 2007). Even more important for our hypothesis about the role of systematic variation in perceptual attunement, 8-month-old bilingual infants may show additional benefits over monolinguals even for discrimination of unfamiliar non-native speech contrasts, whether the speech is presented in audio (Petitto et al., 2012) or visual-only form (Sebastiàn-Gallès, Albareda, Weikum, & Werker, 2012). And beyond the influence of bilingual exposure on speech and word perception, several recent studies reveal cognitive benefits as well. Bilingual 7-month-olds outperform their monolingual peers in non-language tasks that involve learning multiple rules (Frank et al., 2009; Kovács & Mehler, 2009; Kovács, Mehler, & Carey, 2009) or require delayed recall of a series of actions when they must generalize across multiple dimensions of stimulus variation (Brito & Barr, 2014). Altogether, the findings on bilingual-experienced infants support the idea that they not only can and do sort out, but in fact take advantage of, the multi-lingual (and multi-dimensional) variation they are exposed to. They apply that knowledge both to recognition of speech contrasts in their two native languages as well as in unfamiliar languages. Moreover, their ability to detect systematicity in variation along multiple stimulus dimensions extends even beyond spoken language, supporting their ability to categorize and remember multiple dimensions of variation across non-linguistic events and objects.

HOW IS VARIABILITY USED IN DEVELOPMENT?

As outlined above, it appears that enhanced yet systematic variation is important to the development of the perceptual spaces representing faces and spoken words. The suggestion from the above reviews of the developmental trajectory of face and spoken word and phoneme perception is that the first stage of development involves establishing the basic/foundational dimensions of the sensory space specific to an individual infant's sensory environment. This is proposed to be created via the interaction of young infants' early perceptual biases to attend to certain types of stimuli and statistical learning. Perceptual biases would act to constrain the general focus of statistical learning to stimuli most relevant for development of a robust perceptual space. We have discussed some “static” face- and language-specific perceptual biases in infants during the first few months of life (e.g., visual objects that have a face-like energy profile or a top heavy arrangement of elements) and there are, additionally, indications in both the speech perception/word recognition and face perception literature that statistical properties of the input are important in the subsequent developmental organization of the child's internal word space and face space (O'Toole, Abdi, Deffenbacher, & Valentin, 1993; Saffran, Aslin, & Newport, 1996; see also, Gervain & Mehler, 2010; O'Toole, 2011). Beyond this stage it is suggested that there is a progression from statistical learning to a more domain specific, referent-based, abstract and socially or culturally influenced learning in older infants, as discussed within the language acquisition literature (Gervain & Mehler, 2010). The key question, then, is: Which aspects of variability are most important to supporting optimal development of an infants' perceptual space? It would have to be admitted that what is important likely changes with developmental progression as well as situationally, depending on the motivations of the infant and perhaps even the complexity of the environment itself. To answer this question, though, we suggest that the aspects of the sensory environment that are most beneficial to infants can be revealed by observing which aspects they preferentially interact with and which aspects they actively disengage from. Just as we can observe that infants will engage preferentially with a static object that contains properties that make it more face like, we can observe the dynamics of a certain set of stimuli that will preferentially engage an infant's attention. There is a hypothesis, endearingly termed “the Goldilocks effect” (Kidd, Piantadosi, & Aslin, 2012), that infants will engage most with stimuli that are optimal in complexity for their developmental stage, with stimuli outside of the optimal complexity range either failing to attract attention or simply eluding the cognitive or sensory capacity of an infant (what we might call a “dynamic bias”). For example, when presented with checkerboard stimuli with different numbers of checks, infants at 3, 8, and 14 weeks will, respectively, look preferentially at increasingly complex checkerboards (Brennan, Ames, & Moore, 1966). Importantly, with these stimuli the preference is thought to not be linked to acuity or accommodation ability (i.e., the infant is able to resolve the more complex but not preferred stimuli). Rather, their preference is proposed to be due to the level of complexity in the visual stimulus itself, likely related to a more nuanced set of developmental factors within the perceptual system itself. Newborns have also been shown to recognize sequences of pairs of objects when the sequence consisted of only two pairs presented in a random order, but they were not found to recognize the same pairs when the sequences were increased to three pairs of objects (Bulf et al., 2011). In this study the sequences consisted of pairs of simple shapes, such as triangles and squares, which were always presented together in the same order; however, the pairs themselves could be presented in any order relative to each other. Learning was tested using a habituation procedure, which revealed that infants look preferentially at a post-habituation violation of the statistics of the sequence of the objects. This shows that newborns only several days old can learn information about the probabilities of occurrence of related visual stimuli. However, this learning was shown to be confined to two pairs of objects without being apparent when three pairs were presented. Learning, and therefore meaningful engagement with a complex statistical sequence, appears to be constrained by cognitive capacity, suggesting a natural engagement with an optimal level of variability for a newborn. At 5 months of age, infants have been shown to look longer at a random sequence of looming objects than at a sequence composed of repeating pairs of stimuli (a sequence newborns can already learn). They were also found to disengage attention to sequences structured into pairs or triplets mainly at points in the sequence where the transition between shapes is locally repetitive rather than according to the global sequence pattern (Addyman & Mareschal, 2013). This suggests that at 5 months of age infants' looking preferences remain attuned to a certain level of complexity in the stimulation, given that they disengage from a locally repetitive sequence. As the local rather than global repetition appears to govern disengagement, however, it is likely that an infant's mode of engagement with a stimulus containing many types of complexity is also constrained by their current cognitive capacity. At 8 months of age, Kidd et al. (2012) have measured infants' likelihood of looking away from a visual scene composed of objects appearing from behind occluders with varying probabilities. They found that disengagement was related to the level of information contained in the scene. Infants engaged with an intermediate level of complexity; too much or too little and the infant was shown to disengage from the scene. Despite showing clear signs of developing an experience constrained perceptual space, even at 11 months of age infants have been shown to combine predictive cues of different strengths in a straightforward fashion to learn a regular pattern. This is unlike the adult participants in this study, who combined these cues in a way that is less than optimal, apparently favoring an overly complex interpretation of the pattern of the cues (Yurovsky, Boyer, Smith, & Yu, 2013). The pattern suggests that even at 11 months infants demonstrate different cognitive capabilities or strategies than adults. This will have a significant impact on the shaping of their developing perceptual spaces, bestowing an advantage in spite of environmental stimulation that appears suboptimal for an adult. Despite the influence of “The Goldilocks Effect,” if we were to construct an optimal artificial environment for a developing infant, to truly establish what is important to provide at each age, then different aspects of the possible stimuli should be pitted against each other, moreso than investigating what an infant is able to perceive in an isolated cue situation. Using an artificial language created to present infants with a set of useable cues, Saffran and Thiessen (2003) found that at 7.5–8 months of age infants follow statistical information in the form of transitional probabilities of syllables within and between words, rather than relying on stress cues that indicate the start of a word. In contrast, at ∼9 months of age infants were shown to use stress cues over statistical information (Johnson & Jusczyk, 2001). This would imply that if we were to construct an optimal artificial language to aid perception, for infants at 7 months it should focus on conveying the statistical probabilities of the language they are learning, perhaps with exaggeration of permissible transitional probabilities conveyed via repetitions of a range of common, key transitions. By 9 months, however, it would seem that provision of an enhanced range of the stress patterns of words would be more important. In the face recognition literature, although it is difficult to enhance variability in the same way when it comes to identifying individuals, while young infants appear to be sensitive mainly to the individual features of a face they should be best served by seeing many individuals with many different features. From ∼8 months of age, when they become more sensitive to the configuration of a face (Ferguson et al., 2009) they should then be exposed to many faces with different configurations. While this may seem impossible to manipulate naturally (by one person) in the same way as spoken words can be manipulated, it should be acknowledged that at a basic level, a dynamic face presents an infant with a set of continuously changing (yet constrained) features and configurations as the face moves and creates new facial expressions. Beyond this, however, it may be that the exposure infants of this age experience to multiple faces of differing shapes, and multiple speakers with differing voice qualities or accents, may be optimal for them to learn the critical properties of face configurations and words. We should point out that this is not to say we should create an artificial environment that would be optimal for infants. As we have seen, infant directed interactions already naturally provide an optimal environment to infants, which appears to be best suited to infant perceptual development. Moreover, infants themselves appear to tailor their interactions, engaging primarily with aspects of the world that appear to suit their stage of development best. Further studies investigating perceptual constancies pitted directly against discrimination ability within the same sets of stimuli should elaborate the cues that are most important at each age across the many aspects of face and spoken language/word perception.

NEW DIRECTIONS FOR EXPERIMENTATION

The proposals here suggest several new directions in experimentation. In particular, several areas have been highlighted where questions that are common and central in one domain can be translated meaningfully into the other. The new directions can be grouped into studies of the importance of variability in faces and speech during infant directed interaction, studies into perceptual constancy, and studies into perception among multiple dimensions of stimulus variation. The importance of variation in stimuli for infant learning has been highlighted in studies addressing how infants acquire speech perception and spoken word recognition skills, particularly with respect to the increased variability in infant-directed speech input. The further development of cross-language speech perception in bilingual-learning infants is a particularly useful context to investigate the principles we have been discussing. We propose that structured variability is crucial to development of skilled perceptual capabilities. This implies that across the environmental input there must be statistical regularities that structure the increased variability. Within the population of bilingual-learning infants there likely exists great diversity in the structure of infant's interactions with both languages. Investigating bilingual infants' interactions in their two languages could inform us further about which aspects of the structure of variability are important. For example, infants learning two quite similar languages may develop some capabilities at a younger age if the languages are separated by some other context, such as the identity of the speaker (e.g., mum speaks French and dad speaks English). Similarly, to the extent that infants are regularly interacting with adults who mix languages within utterances, we may discover the important limits of an infant's ability to thrive on variability. Conversely, there is also a natural circumstance whereby an infant receives reduced variation in IDS. It has been shown that mothers with depression tend to exhibit flat vocal affect (Bettes, 1988) and their IDS contains significantly less modulation in fundamental frequency (Kaplan, Bachorowski, Smoski, & Zinser, 2001). A careful study of learning of speech distinctions and spoken words by infants with a mother with depression should also highlight the crucial aspects of increased variability in IDS. As yet there have been few studies addressing the importance of increased variability in face recognition. According to the hypothesis that face and language recognition are mediated by the same underlying principle, we would predict that just as in IDS, caregivers' facial motion is more exaggerated when interacting with infants than when interacting with adults in the context of all interactions, not just those outlined above (Brand et al., 2009; Chong et al., 2003; Green et al., 2010; Kim et al., 2012). This exaggeration should be found to be preferred by infants, and also to be crucial to an infant's development of face perception capabilities, even to recognition of individual identities. For example, further studies pitting recognition of individual faces displaying adult directed facial expressions (ADFE) versus infant directed facial expressions (IDFE) should demonstrate that IDFE support a range of judgments over and above ADFE. IDFE may also provide advantages in discrimination of stimuli that are currently thought to be undifferentiated, such as attention to and memory for internal features of a face earlier in infancy. Additionally, we have outlined evidence that suggests that infants actively engage with stimuli of just the right complexity for their developmental stage (Addyman & Mareschal, 2013; Bulf et al., 2011; Kidd et al., 2012). As infant and caregiver interaction is a two way process (see Ainsworth, 1979; Bowlby, 1958), this might suggest that infants are able to influence a parents' infant directed interaction to subtly adjust which aspects of the social stimulus (face or voice) are being enhanced. We would predict that careful measurement of the aspects of caregivers' facial and vocal interactions with infants at each age should reveal changes that will reflect the dimensions of the environmental space that infants are most sensitive to at each stage of perceptual development. To fully understand the transition from purely statistical learning to more nuanced elaboration of the perceptual space capable of supporting complex perceptual constancy judgments, investigation is needed into perceptual constancy in face recognition in particular, but also spoken phoneme and word perception. We would predict that more complex constancies unfold as an infant moves from a statistical learning regimen to the more elaborated and domain-specific referents regimen. The constancy judgments infants are able to achieve should therefore be influenced by experience. As such, we would predict that there may also be a perceptual narrowing found for constancy judgments. Infants past 6 months of age should display perceptual constancy only within experienced classes of stimuli. For example, recognition of individuals across viewpoints should decline for faces within categories that are not experienced, such as other race faces. Similarly, recognition of vowels across speakers should also decline when the vowel is from a non-native language and does not occur in the native language. By building variability into the stimulus design of experiments we can understand the developmental trajectory of not only discrimination but also perceptual constancy. To understand which particular dimensions of the perceptual space are being formed at each developmental stage we propose that it is important to construct stimuli that are able to pit different aspects of the environmental space against each other directly. For example, by constructing an artificial diet of faces that vary to differing extents along dimensions suggested by using an image description system such as PCA, it may be possible to elucidate the aspects of faces that infants are most sensitive to at each stage of perceptual development. Faces are particularly challenging to study, as the units of a face that are nameable (e.g., eyes and nose) do not necessarily correspond to the perceptual units important for perception of face identity. Therefore, it will be extremely important to use statistical and modeling techniques that move away from a language-based face specification manipulation to create stimuli.

CONCLUSION

In summary, we have proposed that the development of perceptual skill in face and spoken word recognition follow a similar trajectory due to a set of common, centrally important learning mechanisms. These mechanisms capitalize on the structured variability in infant directed communication, which supports development of the perceptual spaces representing the organization of the sensory information relating to all aspects of faces and spoken words. The perceptual space is initially elaborated according to physical statistical properties in the infant's environment that lead to the apparent narrowing in discrimination of non-experienced stimuli during the first year of life for both faces and spoken words. Thereafter, a combined environmentally, socially and culturally driven strategy supports the development of elaborated representations that can produce more sophisticated perceptual constancies. We anticipate that these domain-general developmental mechanisms and the subsequent description of the perceptual space would apply to any stimulus for which we become such exquisitely tuned skilled perceivers so early in life. However, the perception of spoken words and faces occupies such an important role in infant development as well as mature perceptual functioning, that these may be the domains in which this is seen most clearly.

137 in total

1. Six-month-old infants' preference for lexical words.

Authors: R Shi; J F Werker
Journal: Psychol Sci Date: 2001-01

2. Distinct representations of eye gaze and identity in the distributed human neural system for face perception.

Authors: E A Hoffman; J V Haxby
Journal: Nat Neurosci Date: 2000-01 Impact factor: 24.884

3. A cross-language comparison of /d/-/th/ perception: evidence for a new developmental pattern.

Authors: L Polka; C Colantonio; M Sundara
Journal: J Acoust Soc Am Date: 2001-05 Impact factor: 1.840

4. Modularity and cognition.

Authors:
Journal: Trends Cogn Sci Date: 1999-03 Impact factor: 20.229

5. Is infant-directed speech prosody a result of the vocal expression of emotion?

Authors: L J Trainor; C M Austin; R N Desjardins
Journal: Psychol Sci Date: 2000-05

6. The distributed human neural system for face perception.

Authors:
Journal: Trends Cogn Sci Date: 2000-06 Impact factor: 20.229

7. Spoken word recognition and lexical representation in very young children.

Authors: D Swingley; R N Aslin
Journal: Cognition Date: 2000-08-14

8. Infant preferences for attractive faces: a cognitive explanation.

Authors: A J Rubenstein; L Kalakanis; J H Langlois
Journal: Dev Psychol Date: 1999-05

9. Human face recognition in sheep: lack of configurational coding and right hemisphere advantage.

Authors: J W. Peirce; A E. Leigh; A P.C. daCosta; K M. Kendrick
Journal: Behav Processes Date: 2001-06-13 Impact factor: 1.777

10. Segmentation of the speech stream in a non-human primate: statistical learning in cotton-top tamarins.

Authors: M D Hauser; E L Newport; R N Aslin
Journal: Cognition Date: 2001-03

8 in total

1. Enhanced attention to speaking faces versus other event types emerges gradually across infancy.

Authors: Lorraine E Bahrick; James Torrence Todd; Irina Castellanos; Barbara M Sorondo
Journal: Dev Psychol Date: 2016-11

Review 2. Sources of Confusion in Infant Audiovisual Speech Perception Research.

Authors: Kathleen E Shaw; Heather Bortfeld
Journal: Front Psychol Date: 2015-12-15

3. Monolingual and Bilingual Infants' Ability to Use Non-native Tone for Word Learning Deteriorates by the Second Year After Birth.

Authors: Liquan Liu; René Kager
Journal: Front Psychol Date: 2018-03-15

4. Modulation of Theta Phase Synchrony during Syllable Processing as a Function of Interactive Acoustic Experience in Infancy.

Authors: Silvia Ortiz-Mantilla; Cynthia P Roesler; Teresa Realpe-Bonilla; April A Benasich
Journal: Cereb Cortex Date: 2022-02-19 Impact factor: 5.357

5. What does a critical period for second language acquisition mean?: Reflections on Hartshorne et al. (2018).

Authors: Arturo E Hernandez; Jean P Bodet; Kevin Gehm; Shutian Shen
Journal: Cognition Date: 2020-10-16

6. Spoken Word Recognition Enhancement Due to Preceding Synchronized Beats Compared to Unsynchronized or Unrhythmic Beats.

Authors: Christos Sidiras; Vasiliki Iliadou; Ioannis Nimatoudis; Tobias Reichenbach; Doris-Eva Bamiou
Journal: Front Neurosci Date: 2017-07-18 Impact factor: 4.677

7. Beauty-related perceptual bias: Who captures the mind of the beholder?

Authors: Yan Zhang; Yu Xiang; Ying Guo; Lili Zhang
Journal: Brain Behav Date: 2018-03-26 Impact factor: 2.708

8. Auditory Emotion Word Primes Influence Emotional Face Categorization in Children and Adults, but Not Vice Versa.

Authors: Michael Vesker; Daniela Bahn; Christina Kauschke; Monika Tschense; Franziska Degé; Gudrun Schwarzer
Journal: Front Psychol Date: 2018-05-01

8 in total