| Literature DB >> 35242348 |
Andrey Anikin1,2, Katarzyna Pisanski2, David Reby2.
Abstract
When producing intimidating aggressive vocalizations, humans and other animals often extend their vocal tracts to lower their voice resonance frequencies (formants) and thus sound big. Is acoustic size exaggeration more effective when the vocal tract is extended before, or during, the vocalization, and how do listeners interpret within-call changes in apparent vocal tract length? We compared perceptual effects of static and dynamic formant scaling in aggressive human speech and nonverbal vocalizations. Acoustic manipulations corresponded to elongating or shortening the vocal tract either around (Experiment 1) or from (Experiment 2) its resting position. Gradual formant scaling that preserved average frequencies conveyed the impression of smaller size and greater aggression, regardless of the direction of change. Vocal tract shortening from the original length conveyed smaller size and less aggression, whereas vocal tract elongation conveyed larger size and more aggression, and these effects were stronger for static than for dynamic scaling. Listeners familiarized with the speaker's natural voice were less often 'fooled' by formant manipulations when judging speaker size, but paid more attention to formants when judging aggressive intent. Thus, within-call vocal tract scaling conveys emotion, but a better way to sound large and intimidating is to keep the vocal tract consistently extended.Entities:
Keywords: acoustic communication; body size; dynamic; formants; vocal tract length
Year: 2022 PMID: 35242348 PMCID: PMC8753157 DOI: 10.1098/rsos.211496
Source DB: PubMed Journal: R Soc Open Sci ISSN: 2054-5703 Impact factor: 3.653
Figure 1An illustration of formant manipulations in Experiment 1. (a) A spectrogram of the original English utterance with the approximate contours of the first four formants traced with red lines. Notice the downward trajectory of formants F3 and F4, suggesting some natural vocal tract elongation in the original, unmanipulated recording. (b) Formant shifts per condition, relative to the original formant frequencies. (c) The same utterance with formants shifted by 2.5 semitones, or approximately 15.5%. (d) Spectrograms of a synthetic roar with formants shifted by 2.5 semitones. Notice the flat original formant contours and clean S-curve transitions in the roar. All spectrograms have a frequency range of 0 to 5 kHz.
Figure 2The effects of formant scaling for each manipulation strength (1.5, 2.0 or 2.5 semitones) relative to unmanipulated stimuli (0 semitones): medians of posterior distributions and 95% CIs. (a) English speech, (b) Persian speech, (c) synthetic roars and (d) average across English/Persian/roars.
Figure 3Design and results of Experiment 2. (a) An example of vocal stimulus. A baseline of a few seconds of neutral speech is followed by a post-stimulus—an aggressive utterance by the same speaker. (b) Post-stimulus manipulation: the initial apparent VTL is adjusted to match the average apparent VTL of baseline, and the general trend for changing apparent VTL in the post-stimulus is removed. In the example shown here, this results in lowering formant frequencies by about 0.4 semitones initially and 0.9 semitones towards the end of the post-stimulus. The ‘flattened’ post-stimulus is further manipulated to shift formant frequencies by ±2 semitones either at once (high/low conditions) or gradually, in an S-curve starting from the neutral value (rising/falling conditions). (c) The effect of dynamic formant manipulations in the post-stimulus compared to the flat condition: medians of posterior distributions and 95% CIs. Violin plots show the distribution of fitted values per prototype sound (N = 46).