Literature DB >> 28338796

Temporal signatures of processing voiceness and emotion in sound.

Abstract

This study explored the temporal course of vocal and emotional sound processing. Participants detected rare repetitions in a stimulus stream comprising neutral and surprised non-verbal exclamations and spectrally rotated control sounds. Spectral rotation preserved some acoustic and emotional properties of the vocal originals. Event-related potentials elicited to unrepeated sounds revealed effects of voiceness and emotion. Relative to non-vocal sounds, vocal sounds elicited a larger centro-parietally distributed N1. This effect was followed by greater positivity to vocal relative to non-vocal sounds beginning with the P2 and extending throughout the recording epoch (N4, late positive potential) with larger amplitudes in female than in male listeners. Emotion effects overlapped with the voiceness effects but were smaller and differed topographically. Voiceness and emotion interacted only for the late positive potential, which was greater for vocal-emotional as compared with all other sounds. Taken together, these results point to a multi-stage process in which voiceness and emotionality are represented independently before being integrated in a manner that biases responses to stimuli with socio-emotional relevance.

Entities: Disease Gene Species

Keywords: auditory cortex; gender; implicit perception; prosody; sex differences

Mesh：

Year: 2017 PMID： 28338796 PMCID： PMC5472162 DOI： 10.1093/scan/nsx020

Source DB: PubMed Journal: Soc Cogn Affect Neurosci ISSN： 1749-5016 Impact factor: 3.436

Introduction

Of the multitude of sounds reaching our ears, the voices of other humans—especially if they are emotional—stand out. Attempting to explain this phenomenon, neuroscience has compared the processing of vocal with non-vocal and that of emotional with neutral stimuli. Resulting insights point to specialized brain mechanisms and networks underpinning representations of voiceness and emotion, respectively (for recent reviews see Schirmer ; Schirmer and Adolphs, in press). Yet, whether and how these representations are integrated is still unexplored. Here we sought to address this issue by manipulating both voiceness and emotion in the context of an event-related potential (ERP) study. Evidence for the special processing of voiceness comes from functional magnetic resonance imaging (fMRI) and ERPs. FMRI research has helped characterize the brain’s auditory system and identified regions that are more excited by human vocalizations as compared with non-human vocalizations (Fecteau ), inanimate nature sounds, or man-made environmental noises (Belin ). These regions are located in the middle aspect of the superior temporal gyrus (STG) and sulcus (STS) and are referred to as temporal voice areas (Yovel and Belin, 2013). ERP evidence has come from the passive oddball paradigm in which participants perform a foreground activity on the backdrop of a task-irrelevant sound sequence comprising frequent standards and rare deviants. Relative to standards, deviants elicit a mismatch negativity around 200 ms following sound onset (Näätänen ) and this negativity is larger when deviants are voiced as compared with synthesized (Schirmer ). Subsequent ERP studies presenting vocal and non-vocal sounds equiprobably identified temporally overlapping effects. They found that voiceness enhances a positive deflection around 200 ms referred to as fronto-temporal positivity to voice (FTPV) (Charest ; Bruneau ) with potential sources along the STG/STS in temporal voice areas (Capilla ). Yet unlike fMRI evidence, ERP evidence has delineated early markers of voiceness preference inconsistently (Levy ; De Lucia ; Rigoulot ). For example, a recent study comparing human voices against a range of other sounds including animal vocalizations, music and sounds from man-made objects failed to observe an overall human effect. Instead, there was an early differentiation between living and non-living sources (70 and 119 ms), followed by an enhancement for human voices relative to animal sounds (169 and 219 ms), and for music relative to other man-made objects (251 and 357 ms) (De Lucia ). As such the authors questioned the special status of the human voice. Apart from signaling the presence of another person, social stimuli inform about that person’s identity (e.g. age, sex) and mental state. Of particular interest here is the emotion content of social stimuli and how that content is processed. Both fMRI and ERP research suggest that emotionality enhances auditory representations in general and vocal representations in particular. Looking at sounds in general, there is evidence that auditory cortex, amygdala and medial prefrontal cortex are more active for positive and negative as compared with neutral conditions (Viinikainen ; Escoffier ). In the ERP, a late positive potential (LPP) is modulated by emotion. Task-relevant deviants presented in an active oddball paradigm elicit a larger LPP than standards and this effect is greater when deviants differ from standards in affect as compared with intensity (Thierry and Roberts, 2007). Looking at voices more specifically, emotionality excites the temporal voice areas especially in the right hemisphere as well as in left inferior frontal gyrus (Kotz ; Warren ; Leitman ; Frühholz ). Perhaps surprisingly, the amygdala is rarely implicated (Ethofer ; Brück ; Mothes-Lasch ) unless more lenient statistical thresholds are used (Beaucousin ; Fecteau ). In the ERP, the LPP shows larger amplitudes for emotional as compared with neutral expressions (Pell ; Pinheiro ). Additionally, there are earlier emotion effects temporally overlapping with the FTPV. In a passive oddball paradigm, emotional relative to neutral voices enhance the mismatch negativity around 200 ms following stimulus onset (Schirmer , 2016a). For equiprobable stimulation, a P200 modulation shows fairly consistently and may be related to the FTPV (see “Discussion” section). The P200, a centrally distributed component, differentiates between different kinds of emotional expression or is larger when voices are emotional as compared with neutral (Paulmann and Kotz, 2008; Sauter and Eimer, 2010; Schirmer ). Although the brain structures and mechanisms supporting social perception seem fairly universal, they differ somewhat between the sexes (Proverbio ; Schirmer ; Proverbio and Galli, 2016). For example, sex differences have been reported for the temporal voice areas, which are larger and more voice-sensitive in women as compared with men (Ahrens ). Women are also more sensitive than men to acoustic change in vocal sounds (Schirmer ) as well as to vocal emotions. For example, when emotions are task-irrelevant, the P200 amplitude difference between emotional and neutral voices is greater in women than in men (Schirmer ). In sum, a substantial number of both fMRI and ERP studies have tackled the perception of voiceness and emotionality suggesting that more processing resources are dedicated towards vocal as compared with non-vocal and emotional as compared with neutral sounds. Additionally, compared with men, women seem more sensitive to voices and the emotions encoded in them. Notably, however, past research pursued voiceness and emotionality effects separately. To the best of our knowledge, both features have been manipulated within the same study only in the visual modality. In this study, participants saw positive and negative scenes that did or did not contain people (Proverbio ). Scenes with but not without people elicited a greater positivity around 100 ms when scenes were positive as compared with negative. Humanness and emotionality also interacted for a negativity peaking around 200 ms and the following LPP and these latter interactions were more prominent in female than in male participants. Thus, it seems that humanness may be processed in combination with rather than separate from emotionality and that their later and perhaps more top-down integration is stronger in women than in men. In the present study, we sought to explore the temporal course of voiceness effects in the ERP and to determine their relation with emotionality. We presented neutral and surprised exclamations and their spectrally rotated counterparts in random order and with equal probability. The rotated stimuli, although distinctly non-human, were acoustically similar to their originals and preserved some emotionality (Scott ; Warren ; Obleser ; Sauter and Eimer, 2010). Participants detected rare sound repetitions. Our expectation was that, in line with some reports, voiceness and emotion enhance a positive component peaking around 200 ms following stimulus onset. Additionally, based on evidence from the visual modality (Proverbio ), we speculated that voiceness and emotionality interact in this component and, perhaps, subsequently. Last, we anticipated that, compared with men, women are more sensitive to voiceness and emotion.

Methods

Participants

Thirty-five participants were recruited for the experiment. Three participants were excluded from data analysis because of too many movement artifacts in the EEG. Half of the remaining participants were female with an average age of 24.8 years (s.d. = 2.9). Male participants had an average age of 25.2 years (s.d. = 2.8). All participants were right-handed and reported an absence of hearing or neurological impairments.

Stimulus materials

Twenty-seven individuals expressed ‘Ah’ with surprise and neutrality. Non-vocal controls were created by spectral rotation (http://www.phon.ucl.ac.uk/resource/software-other.php) as to retain basic similarity with the vocal originals (Obleser ; Warren ). Nevertheless, sound acoustics necessarily changed as described in the Supplementary Materials. Although sounding distinctly non-human, rotated surprised sounds were perceived as more emotional than rotated neutral sounds (see Supplementary Materials).

Procedure

Participants were tested individually. To prepare them for the EEG recording, a 64-channel cap with empty electrode holders was placed on their head. The electrode holders, which were organized according to the modified 10–20 system, were filled with electrolyte gel and electrodes placed into them. Individual electrodes were attached above and below the right eye and at the outer canthus of each eye to measure eye movements. One electrode was attached to the nose for data referencing as to enable the exploration of dipoles situated in auditory cortex (Näätänen ). The data were recorded at 500 Hz with a BrainAmp EEG system. Only an anti-aliasing filter was applied during data acquisition (i.e. sinc filter with a half-power cutoff at half the sampling rate). Following the EEG set-up, participants were seated in front of a computer screen that was framed by two speakers. On-screen instructions informed participants that they would hear a sequence of sounds and that their task was to press a button using their right hand any time a sound was immediately repeated. The task comprised three blocks in which sounds (i.e. 27 neutral/vocal, 27 surprised/vocal, 27 neutral/non-vocal and 27 surprised/non-vocal) were played in random order. Thus, stimuli were played thrice across separate blocks. The two stimuli forming a given vocal/non-vocal sound pair occurred in separate block halves as to minimize the emergence of potential acoustic associations. In addition to the trials described thus far, 24 sounds were randomly selected for repetition within each block as to engage participants with the auditory material without highlighting the nature of the sounds and without necessitating a confounding motor response on unrepeated, experimental trials. Each trial started with a white fixation cross centered on a black background. After 500 ms, a sound played (average duration = 506 ms; s.d. = 25 ms) and the fixation cross remained for 1000 ms. An empty inter-trial interval had a random duration ranging between 2000 and 4000 ms.

Data analysis

EEG data were processed with EEGLAB (Delorme and Makeig, 2004). The recordings were subjected to low- and high-pass filtering with a half-power cut-off at 30 and 0.1 Hz, respectively. The transition band was 7.5 Hz for the low pass filter (−6 dB/octave; 221 pts) and 0.1 Hz for the high pass filter (−6 dB/octave; 16 501 pts). The continuous data were epoched and baseline-corrected using a 200 ms pre-stimulus baseline and a 1000 ms post-stimulus window. The resulting epochs were visually scanned for non-typical artifacts caused by drifts or muscle movements. Epochs containing such artifacts were removed. Infomax, an independent component analysis algorithm, was applied to the remaining data, and components reflecting typical artifacts (i.e. horizontal and vertical eye movements and eye blinks) were removed. Back-projected single trials were again screened visually for residual artifacts and ERPs were derived by averaging individual epochs for each condition and participant including only non-repeated trials on which participants correctly withheld a response. A minimum of 62 and an average of 75 trials per condition entered statistical analysis. We identified the latency ranges of target ERP components based on visual inspection and prior work (see Supplementary Materials for a figure of all electrode traces). Mean voltages from within these ranges were subjected to an ANOVA with ‘Voiceness’ (vocal, non-vocal), ‘Emotion’ (surprised, neutral), ‘Hemisphere’ (left, right) and ‘Region’ (anterior, central, posterior) as repeated measures factors and ‘Sex’ as the between subjects factor. The factors ‘Hemisphere’ and ‘Region’ comprised average voltages computed across the following subgroups of electrodes: anterior left, Fp1, AF7, AF3, F5, F3, F1; anterior right, Fp2, AF8, AF4, F6, F4, F2; central left, FC3, FC1, C3, C1, CP3, CP1; central right, FC4, FC2, C4, C2, CP4, CP2; posterior left, P5, P3, P1, PO7, PO3, O1; posterior right, P6, P4, P2, PO8, PO4, O2. This selection of electrodes ensured that the tested subgroups contained equal number of electrodes while providing a broad scalp coverage that allowed the assessment of topographical effects. To facilitate the comparison of the present results with other labs, we included an analysis of data re-referenced to the average of all electrodes and an analysis re-referenced to the average of left and right mastoids into the Supplementary Materials. We only report effects involving factors of interest (i.e. ‘Voiceness’, ‘Emotion’) and interactions for which follow-up analyses reached significance (P < 0.05) in the nose-referenced data-set or in any of the other two data sets.

Results

Behavioral results

We computed d-prime sensitivity scores by subtracting the normalized probability of hits (i.e. button presses to repeated sounds) from the normalized probability of false alarms (i.e. button presses to non-repeated sounds). The resulting scores were subjected to an ANOVA with ‘Voiceness’ (vocal, non-vocal) and ‘Emotion’ (neutral, surprised) as repeated measures factors and ‘Sex’ as a between subjects factor. A significant effect of ‘Voiceness’ [F(1,30) = 23.2, P < 0.0001, η2G = 0.132] indicated that participants were more sensitive to vocal than to non-vocal repetitions (Figure 1). Hit reaction times were analyzed using a comparable statistical model. This revealed effects of ‘Voiceness’ [F(1,30) = 21.3, P < 0.0001, η2G = 0.036] and ‘Emotion’ [F(1,30) = 10.9, P < 0.01, η2G = 0.017]. Participants responded faster to vocal and surprised sounds as compared with non-vocal and neutral sounds. All other effects were non-significant (Ps > 0.1).

Fig. 1

Experimental task performance. The top row illustrates the sensitivity with which female (left) and male (right) listeners discriminated between repeated and non-repeated sounds. The bottom row illustrates the speed with which female (left) and male (right) listeners pushed the button to sound repetitions.

Electrophysiological results

The N1 was explored between 80 and 120 ms following stimulus onset. Mean amplitudes derived by averaging data points from within this time range were subjected to an ANOVA with ‘Emotion’, ‘Voiceness’, ‘Hemisphere’ and ‘Region’ as repeated measures factors and ‘Sex’ as a between subjects factor. A main effect of ‘Voiceness’ [F(1,30) = 17.47, P < 0.001, η2G = 0.017] indicated that N1 amplitudes were larger in the vocal than the non-vocal condition. Interactions of ‘Voiceness’ and ‘Hemisphere’ [F(1,30) = 4.91, P < 0.05, η2G = 0.0003] and ‘Voiceness’ and ‘Region’ [F(2,60) = 5.39, P < 0.05, η2G = 0.001] showed that this effect differed across the scalp. Exploring the ‘Voiceness’ effect for each level of ‘Hemisphere’ revealed greater effects at right [F(1,30) = 22.22, P < 0.001, η2G = 0.021] as compared with left recording sites [F(1,30) = 12.97, P < 0.01, η2G = 0.013]. Exploring the ‘Voiceness’ effect for each level of ‘Region’ pointed to greater effects over central [F(1,30) = 22.45, P < 0.0001, η2G = 0.025] as opposed to anterior [F(1,30) = 6.52, P < 0.05, η2G = 0.001] and posterior recording sites [F(1,30) = 17.02, P < 0.001, η2G = 0.016; Figures 2 and 3]. The factor ‘Emotion’ was significant in an interaction with ‘Region’ [F(2,60) = 6.03, P < 0.01, η2G = 0.001] indicating that the N1 tended to be larger for neutral than for surprised voices over anterior [F(1,30) = 3.47, P = 0.072, η2G = 0.004] but not central and posterior regions (Ps > 0.1).

Fig. 2

ERP traces. Illustrated are average voltages recorded to the four sound conditions for female (left) and male (right) participants.

Fig. 3

ERP maps. Topographical maps illustrate the average differences between emotional and neutral as well as vocal and non-vocal sounds for female (top) and male (bottom) participants. Average differences were computed for the four statistical analysis windows capturing the N1, the P2/P3 complex, the N4-like negativity, and the LPP.

ERP traces. Illustrated are average voltages recorded to the four sound conditions for female (left) and male (right) participants. ERP maps. Topographical maps illustrate the average differences between emotional and neutral as well as vocal and non-vocal sounds for female (top) and male (bottom) participants. Average differences were computed for the four statistical analysis windows capturing the N1, the P2/P3 complex, the N4-like negativity, and the LPP. The P2 was immediately followed by another positivity and effects for both seemed comparable. Hence, we examined their mean voltages jointly between 150 and 350 ms following stimulus onset. Statistical analysis produced a ‘Voiceness’ main effect [F(1,30) = 36.26, P < 0.0001, η2G = 0.082] indicating greater positivity for vocal than for non-vocal sounds. An interaction of ‘Voiceness’, ‘Region’ and ‘Sex’ [F(2,60) = 3.12, P = 0.05, η2G = 0.0016] was pursued for women and men separately. In both groups, the ‘Voiceness’ by ‘Region’ interaction [females, F(2,30) = 47.08, P < 0.0001, η2G = 0.033; males F(2,30) = 15.52, P < 0.0001, η2G = 0.023] revealed that the ‘Voiceness’ effect was maximal over anterior [females, F(1,15) = 74.33, P < 0.0001, η2G = 0.302; males F(1,15) = 23.36, P < 0.001, η2G = 0.133], small over central [females, F(1,15) = 33.75, P < 0.0001, η2G = 0.176; males F(1,15) = 18.39, P < 0.001, η2G = 0.133) and non-significant over posterior regions (Ps > 0.1). Notably, however, the ‘Voiceness’ effect was considerably greater in women than in men. The P2/P3 complex was also characterized by an interaction of ‘Emotion’ and ‘Region’ [F(2,60) = 7.41, P < 0.01, η2G = 0.0023]. Follow-up analyses showed that surprised sounds elicited greater positivity than neutral sounds over anterior [F(1,30) = 7.33, P < 0.05, η2G = 0.0115] and central [F(1,30) = 5.85, P < 0.05, η2G = 0.0072] but not posterior regions (P > 0.1). . The P2/P3 complex was followed by an N4-like negativity. An exploration of mean voltages between 350 and 500 ms revealed effects of ‘Voiceness’ [F(1,30) = 5.17, P < 0.05, η2G = 0.006] and ‘Emotion’ [F(1,30) = 5.06, P < 0.05, η2G = 0.007] indexing greater amplitudes for non-vocal than for vocal and for neutral than for surprised stimuli. The ‘Voiceness’ effect was further qualified by interactions involving ‘Voiceness’ and ‘Region’ [F(2,60) = 19.72, P < 0.0001, η2G = 0.014] and ‘Voiceness’, ‘Region’, ‘Hemisphere’ and ‘Sex’ [F(2,60) = 6.5, P < 0.01, η2G = 0.0002]. In women, the interaction of ‘Voiceness’, ‘Region’ and ‘Hemisphere’ [F(2,30) = 3.27, P = 0.052, η2G = 0.0002] was pursued for each level of ‘Region’. Over anterior and central recording sites, the ‘Voiceness’ by ‘Hemisphere’ interaction [anterior, F(1,15) = 2.24, P = 0.15, η2G = 0.001; central, F(1,15) = 3.33, P = 0.087, η2G = 0.0007] was non-significant or marginal but the ‘Voiceness’ effect was significant [anterior, F(1,15) = 41.37, P < 0.0001, η2G = 0.151; central, F(1,15) = 7.45, P < 0.05, η2G = 0.022]. There were no effects over posterior recording sites (Ps > 0.1). In men, the interaction of ‘Voiceness’, ‘Region’ and ‘Hemisphere’ [F(2,30) = 6.15, P < 0.01, η2G = 0.0002] was significant also. However, follow-up analyses revealed only a ‘Voiceness’ main effect over anterior regions [F(1,15) = 6.99, P < 0.05, η2G = 0.026]. All other effects were non-significant (Ps > 0.1). The ‘Emotion’ effect was qualified by an interaction of ‘Emotion’ and ‘Region’ [F(2,60) = 3.68, P = 0.05, η2G = 0.001] and an interaction of ‘Emotion’, ‘Hemisphere’ and ‘Sex’ [F(1,30) = 5.65, P < 0.05, η2G = 0.0002]. Across participants, the ‘Emotion’ effect was present over anterior [F(1,30) = 5.39, P < 0.05, η2G = 0.015], central [F(1,30) = 7.88, P < 0.01, η2G = 0.014] but not posterior regions (P > 0.1). In female participants, the ‘Emotion’ effect was independent of ‘Hemisphere’ (P > 0.1). In male participants, the ‘Emotion’ effect interacted with ‘Hemisphere’ [F(1,15) = 5.43, P < 0.05, η2G = 0.0004] in that it was significant over right [F(1,15) = 4.9, P < 0.05, η2G = 0.019] but not left (P > 0.1) recording sites. The LPP peaked between 500 and 800 ms following stimulus onset. Analysis of mean voltages within this time window revealed an ‘Emotion’ effect [F(1,30) = 9.45, P < 0.01, η2G = 0.014] with greater amplitudes for surprised than neutral stimuli. The ‘Emotion’ effect was qualified by interactions of ‘Emotion’, ‘Hemisphere’ and ‘Sex’ [F(1,30) = 8.43, P < 0.01, η2G = 0.0003] and ‘Emotion’, ‘Voiceness’ and ‘Hemisphere’ [F(1,30) = 5.38, P < 0.01, η2G = 0.0002]. Exploring the first interaction revealed again that the ‘Emotion’ effect differed by ‘Hemisphere’ in male [F(1,15) = 11.15, P < 0.01, η2G = 0.0005] but not female participants (P > 0.1). In men, but not in women, the effect was greater over right [F(1,15) = 7.39, P < 0.05, η2G = 0.04] as compared with left hemisphere leads [F(1,15) = 4.92, P < 0.05, η2G = 0.023]. Exploring the second interaction revealed an ‘Emotion’ by ‘Voiceness’ interaction in the left [F(1,30) = 4.27, P = 0.05, η2G = 0.004] but not the right hemisphere (P > 0.1). Over the left hemisphere, vocal [F(1,30) = 8.73, P < 0.01, η2G = 0.028] but not non-vocal sounds (P > 0.1) elicited a greater positivity for surprised as compared with neutral expressions. Although the ‘Voiceness’ effect was non-significant, there was an interaction of ‘Voiceness’, ‘Region’ and ‘Sex’ [F(2,60) = 8.15, P < 0.001, η2G = 0.003] for which follow-up comparisons were significant in both women [F(2,30) = 28.74, P < 0.0001, η2G = 0.023] and men [F(2,30) = 5.48, P < 0.01, η2G = 0.005]. In women, vocal sounds elicited a greater amplitude than non-vocal sounds over anterior electrodes [F(2,30) = 19.56, P < 0.001, η2G = 0.07]. The effect was non-significant over central electrodes (P > 0.1) and reversed polarity over posterior electrodes [F(2,30) = 8.83, P < 0.01, η2G = 0.019]. In men, the ‘Voiceness’ effect approached significance over anterior regions only [F(1,15) = 4.03, P = 0.063, η2G = 0.008]. . Statistical analysis demonstrated that fronto-central ‘Voiceness’ but not ‘Emotion’ effects for P2/P3, N4-like, and LPP components reversed polarity over the mastoids (see Supplementary Materials).

Discussion

In this study, we presented participants with vocal and non-vocal sounds of emotional and neutral quality in order to shed light on the temporal course underpinning vocal-emotional processing. We found that voiceness modulated the amplitude of a negativity peaking 100 ms following stimulus onset. This negativity belongs to the N1 family, which is modulated by attention in a top-down or bottom-up manner. Attended stimuli or stimuli that capture attention endogenously elicit a greater N1 than less or unattended stimuli (Woldorff and Hillyard, 1991; Escoffier ). The present N1 modulation was largest over right and centro-parietal electrodes. It was hence compatible with sources in higher-order auditory regions like the temporal voice areas. Moreover, it agrees with other evidence suggesting that the right hemisphere is more relevant than the left for social processing (for a review see Brauer ). The early voiceness effect identified here aligns with visual work showing P1 differences between images with and without people (Proverbio ) and implicating the N170, which, although later than the present N1, belongs to the N1 family (Bentin ). In contrast, the present results diverge from prior auditory work reporting voiceness modulations at 200 ms following stimulus onset. Different factors may be responsible for this discrepancy. For example, we presented 162 vocal and 162 non-vocal sounds to 32 participants and was thus better powered than most previous work (Levy ; De Lucia ; Capilla ). Additionally, our high-pass filter settings were lower (Levy ; Charest ) and thus more appropriate for the examination of early/fast signal aspects. Last, we compared voices with their spectral rotations whereas other studies implemented other comparisons (e.g. non-human animal vocalizations, music, non-living objects) with other strengths and weaknesses as concerns the control of acoustic and conceptual confounds. Hence, we cannot rule out that the present N1 effects were caused by acoustic and/or conceptual confounds inherent in the present design. The N1 was succeeded by two positivities named P2 and P3 reflecting stimulus perception and categorization (Johnson and Donchin, 1978; Woldorff and Hillyard, 1991; Schirmer ; Schirmer ). As predicted, their voltages were more positive for vocal as compared with non-vocal sounds and this difference extended into the remainder of the ERP epoch affecting subsequent components. Notably, there was not only a change in polarity but also in topography from the N1 effect. Specifically, the P2/P3 effect showed bilaterally with a maximum at fronto-central electrodes and reversed polarity over the mastoids. Although the ERP inverse problem means that a given scalp topography can be explained by more than one underlying source pattern, the scalp topography observed here is typically linked to a contribution of auditory cortex (Näätänen ). One may speculate that the present P2/P3 effect relates to the FTPV (Charest ; Capilla ). Both occur within a similar time range over fronto-central electrodes. Moreover, differences over the posterior scalp where the FTPV but not the present P2/P3 reverses polarity are easily explained by differences in the reference electrode. Unlike a single channel reference, the average reference used previously forces the ERP into a dipolar pattern (i.e. potentials across channels sum to 0). This is evident from an analysis of the present data with average reference which revealed a patter akin to that of the FTPV (see Supplementary Materials). Nevertheless, we refrain from using FTPV terminology because we have no clear evidence that the underlying mechanisms are indeed voice-specific. Given their similarity to other kinds of sound processing (De Lucia ; Schirmer ), they likely have a more general nature. Emotion effects emerged simultaneously with voiceness effects. However, in the main (nose-referenced) analysis they were only marginal for the N1 and significant in the P2/P3 complex beginning 150 ms following sound onset. Although initially, emotion effects were similar to those of voiceness, they differed in that they failed to reverse polarity over the mastoids. Moreover, they occurred independently of voiceness processing. The P2/P3 complex was not greater for emotional sounds in the vocal than the non-vocal condition or for vocal sounds in the emotional than the neutral condition. Such super-additivity appeared only later in the epoch, for the LPP, and after the emotion effect gained significance over central electrodes differentiating more clearly from the voiceness effect. The late interaction of emotion and voiceness diverges from visual evidence (Proverbio ). Additionally, it seems at odds with prior auditory evidence for vocal content interacting with verbal content in the N4 (Schirmer and Kotz, 2003). However, ours is the first study to tackle the confluence of vocal and emotion processing and provides a strong test of interactive effects. Specifically, the very nature of our non-vocal sounds should have promoted rather than hampered the interaction of voiceness and emotion during stimulus processing. Although, non-vocal sounds retained emotion aspects of their originals, their emotion was recognized more poorly and arousal differences between surprise and neutral stimuli were perceived as weaker (see Supplementary Materials). This should have hampered emotional processing for non-vocal relative to vocal sounds, which in turn should have a by emotion interaction. That this interaction was absent before the LPP provides convincing support that voiceness and emotion are treated largely independently before being integrated at a later and more controlled processing stage. The LPP appears to reflect this integration. Its amplitude over the left hemisphere was greater for emotional as compared with neutral vocal but not non-vocal sounds. As apparent in Figure 2, emotional voices differed from all other sounds suggesting that they won the competition for resources. This effect compares to previous reports of the LPP being larger for emotional as compared with neutral stimuli and may reflect emotional strength or simply attention allocation as a function of stimulus significance (Moser ; Foti and Hajcak, 2008; Schirmer ). That the voice–emotion interaction was significant in the left hemisphere only accords with fMRI evidence for an involvement of left inferior frontal gyrus for vocal emotions (Kotz ; Warren ; Leitman ; Frühholz ) and may be voice-specific (Schirmer and Adolphs, in press). The evidence discussed thus far highlights temporally and morphologically distinct effects that could map onto different processing stages. In line with proposals made recently (De Lucia ; Perrodin ), a first stage may involve basic level processing that discriminates living from non-living sources and that may occur around 100 ms following stimulus onset. Although our N1 results concord with this, a direct mapping can only be tentative as acoustic (e.g. HNR) and conceptual factors (e.g. sound familiarity) offer alternative explanations. A second stage could entail subordinate level processing further specifying a sound’s source (e.g. human vs non-human animal). Presumably this begins around 150 ms (De Lucia ) and thus overlaps with the P2/P3 results obtained here. Because emotions or affect are inherent to humans and animals alike, their representations may emerge early in the course of basic level processing. Nevertheless, they are not immediately integrated as is evident from the fact that interaction effects were non-significant for both N1 and P2/P3 complex. Emotion modulated voiceness effects only later for the LPP pointing to a possible third stage during which the different sound properties merge into a holistic sound object and processing prioritizes some objects over others (Figure 4).

Fig. 4

Interpretative framework. Voice processing is illustrated by example. When presented with a vocal sound we may first represent its animacy and basic affect, before accessing subordinate level sound categories. Subsequently, these separate representations may be integrated into a holistic percept. Extant research suggests that women engage in preferential social processing more readily than men do (Schirmer ,b; Proverbio ; Proverbio and Galli, 2016). The present results corroborate this. The voiceness positivity effect at 200 ms following stimulus onset was significantly greater in women than in men. Moreover, it lasted well into the LPP where it changed into a different pattern over posterior electrodes. This posterior LPP effect was characterized by a greater positivity for non-vocal relative to vocal sounds. Within the interpretational framework outlined above, these findings suggest that initial basic level processing is comparable in men and women. Differences appear only for subsequent and putative subordinate level processing. Perhaps, after having identified a sound source as living, women direct more resources than men at further specifying the sound. Moreover, the posterior LPP effect might reflect additional top-down processes directed at inferring some sort of animacy from the non-vocal sounds—a process known to differ between the sexes (Proverbio and Galli, 2016). Although this study provides novel insights into vocal-emotional processing, it also raises questions for future research. For example, we have compared vocal with spectrally rotated sounds thus controlling for some but certainly not all acoustic stimulus differences. Moreover, whereas the vocal sounds were highly familiar and natural, their rotated counterparts were not. Thus, it is important for future research to compare human voices with other controls such as animal vocalizations (Fecteau ), music (Escoffier ), or environmental noises (Belin ; De Lucia ). Consistent results across these conditions would help rule out the confounds that necessarily arise for each control individually. Another question that should be tackled is why, in this study, women were not more emotionally sensitive than men. Instead, the sexes differed simply in the laterality of emotion effects. In men but not women, the N4-like negativity and the LPP effects were larger at right than left electrodes. Possibly female voiceness sensitivity was so strong as to override any preferential attention to emotion. Future research could test this possibility by presenting vocal and non-vocal sounds in separate blocks. Moreover, new studies could employ different neuroimaging techniques as to characterize underlying spatial sources. A candidate here is functional near-infrared spectroscopy, which yields both high spatial and high temporal resolution in the cortex (Tse and Penney, 2008; Tse ). Despite these open questions, however, this work allows for some conclusions to be made. Specifically, our findings show that listeners, especially women, direct more processing resources to vocal than to non-vocal sounds if voiceness is task-irrelevant. This effect unfolds 100 ms following sound onset and is characterized by different processing stages potentially reflecting basic level categorization, subordinate level categorization, and the integration of subordinate-level information, respectively. Voices when compared against their spectral rotations produce effects that are larger but temporally overlapping with emotion effects. Moreover, both are independent before interacting in a manner that enhances the processing of vocal-emotional over vocal-neutral and non-vocal sounds. Taken together, our findings underline the human bias towards conspecifics and the social nature of the human brain.

Supplementary data

Supplementary data are available at SCAN online. Conflict of interest. None declared. Click here for additional data file.

53 in total

1. Deconstructing reappraisal: descriptions preceding arousing pictures modulate the subsequent neural response.

Authors: Dan Foti; Greg Hajcak
Journal: J Cogn Neurosci Date: 2008-06 Impact factor: 3.225

2. Early emotional prosody perception based on different speaker voices.

Authors: Silke Paulmann; Sonja A Kotz
Journal: Neuroreport Date: 2008-01-22 Impact factor: 1.837

3. The voices of seduction: cross-gender effects in processing of erotic prosody.

Authors: Thomas Ethofer; Sarah Wiethoff; Silke Anders; Benjamin Kreifelts; Wolfgang Grodd; Dirk Wildgruber
Journal: Soc Cogn Affect Neurosci Date: 2007-12 Impact factor: 3.436

4. Vocal emotions influence verbal memory: neural correlates and interindividual differences.

Authors: Annett Schirmer; Ce-Belle Chen; April Ching; Ling Tan; Ryan Y Hong
Journal: Cogn Affect Behav Neurosci Date: 2013-03 Impact factor: 3.282

Review 5. The Socio-Temporal Brain: Connecting People in Time.

Authors: Annett Schirmer; Warren H Meck; Trevor B Penney
Journal: Trends Cogn Sci Date: 2016-09-05 Impact factor: 20.229

6. On how P300 amplitude varies with the utility of the eliciting stimuli.

Authors: R Johnson; E Donchin
Journal: Electroencephalogr Clin Neurophysiol Date: 1978-04

7. Auditory rhythms entrain visual processes in the human brain: evidence from evoked oscillations and event-related potentials.

Authors: Nicolas Escoffier; Christoph S Herrmann; Annett Schirmer
Journal: Neuroimage Date: 2015-02-19 Impact factor: 6.556

Temporal signatures of processing voiceness and emotion in sound.

Introduction

Methods

Participants

Stimulus materials

Procedure

Data analysis

Results

Behavioral results

Electrophysiological results

Discussion

Supplementary data

1. Deconstructing reappraisal: descriptions preceding arousing pictures modulate the subsequent neural response.

2. Early emotional prosody perception based on different speaker voices.

3. The voices of seduction: cross-gender effects in processing of erotic prosody.

4. Vocal emotions influence verbal memory: neural correlates and interindividual differences.

Review 5. The Socio-Temporal Brain: Connecting People in Time.

6. On how P300 amplitude varies with the utility of the eliciting stimuli.

7. Auditory rhythms entrain visual processes in the human brain: evidence from evoked oscillations and event-related potentials.

8. Squeeze me, but don't tease me: human and mechanical touch enhance visual attention and emotion discrimination.

9. Listen up! Processing of intensity change differs for vocal and nonvocal sounds.

10. Neural markers of a greater female responsiveness to social stimuli.

1. The right touch: Stroking of CT-innervated skin promotes vocal emotion processing.

2. Early spatial attention deployment toward and away from aggressive voices.

3. Vocal threat enhances visual perception as a function of attention and sex.

4. Abnormal Habituation of the Auditory Event-Related Potential P2 Component in Patients With Schizophrenia.

5. The neural basis of authenticity recognition in laughter and crying.

6. Differences in dogs' event-related potentials in response to human and dog vocal stimuli; a non-invasive study.