Literature DB >> 32393618

Acoustic information about upper limb movement in voicing.

Wim Pouw^1,2,3, Alexandra Paxton^4,5, Steven J Harrison^4,6, James A Dixon^4,5.

Abstract

We show that the human voice has complex acoustic qualities that are directly coupled to peripheral musculoskeletal tensioning of the body, such as subtle wrist movements. In this study, human vocalizers produced a steady-state vocalization while rhythmically moving the wrist or the arm at different tempos. Although listeners could only hear and not see the vocalizer, they were able to completely synchronize their own rhythmic wrist or arm movement with the movement of the vocalizer which they perceived in the voice acoustics. This study corroborates recent evidence suggesting that the human voice is constrained by bodily tensioning affecting the respiratory-vocal system. The current results show that the human voice contains a bodily imprint that is directly informative for the interpersonal perception of another's dynamic physical states.

Entities: Disease Species

Keywords: hand gesture; interpersonal synchrony; motion tracking; vocalization acoustics

Year: 2020 PMID： 32393618 PMCID： PMC7260986 DOI： 10.1073/pnas.2004163117

Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN： 0027-8424 Impact factor: 11.205

Human speech is a marvelously rich acoustic signal, carrying communicatively meaningful information on multiple levels and timescales (1–4). Human vocal ability is held to be much more advanced compared to our closest living primate relatives (5). Yet despite all its richness and dexterity, human speech is often complemented with hand movements known as co-speech gesture (6). Current theories hold that co-speech gestures occur because they visually enhance speech by depicting or pointing to communicative referents (7, 8). However, speakers do not just gesture to visually enrich speech: Humans gesture on the phone when their interlocutor cannot see them (9), and congenitally blind children even gesture to one another in ways indistinguishable from gestures produced by sighted persons (10). Co-speech gestures, no matter what they depict, further closely coordinate with the melodic aspects of speech known as prosody (11). Specifically, gesture’s salient expressions (e.g., sudden increases in acceleration or deceleration) tend to align with moments of emphasis in speech (12–17). Recent computational models trained on associations of gesture and speech acoustics from an individual have succeeded in producing very natural-looking synthetic gestures based on novel speech acoustics from that same individual (18), suggesting a very tight (but person-specific) relation between prosodic–acoustic information in speech and gestural movement. Such research dovetails with remarkable findings that speakers in conversation who cannot see and only hear each other tend to synchronize their postural sway (i.e., the slight and nearly imperceptible movement needed to keep a person upright) (19, 20). Recent research suggests that there might indeed be a fundamental link between body movements and speech acoustics: Vocalizations were found to be acoustically patterned by peripheral upper limb movements due to these movements also affecting tensioning of respiratory-related muscles that modulate vocal acoustics (21). This suggests that the human voice has a further complexity to it, carrying information about movements (i.e., tensioning) of the musculoskeletal system. In the current study we investigate whether listeners are able to perceive upper limb movement information in human voicing.

Methods and Materials

To assess whether listeners can detect movement from vocal acoustics, we assessed whether listeners could synchronize their arm or wrist movement by listening to vocalizers who were instructed to move their arm or wrist at different tempos. We first collected naturalistic data from six prestudy participants (vocalizers; three each cisgender males and females) who phonated the vowel /ə/ (as in cinema) with one breath while moving the wrist or arm in rhythmic fashion at different tempos (slow vs. medium vs. fast). Prestudy participants were asked to keep their vocal output as stable and monotonic as possible while moving their upper limbs. Movement tempo feedback was provided by a green bar that visually represented the duration of the participant’s immediately prior movement cycle (as measured through the motion tracking system) relative to that specified by the target tempo (Fig. 1). Participants were asked to keep the bar within a particular region (i.e., within 10% of the target tempo). The green bar therefore provided information about their immediately previous movement tempo relative to the prescribed tempo without the visual representation moving at that tempo itself. It is important to note that vocalizers were thus not exposed to an external rhythmic signal, such as a (visual) metronome. Further note that we found in an earlier study that when vocalizers move at their own preferred tempo—with no visual feedback about movement tempo—acoustic modulations are also obtained that are tightly synchronized with movement cycles (22). If participants vocalize without movements, however, acoustic modulations are absent (21). Similar to previous research (21, 22), in the current study hand movements inadvertently affected voice acoustics of these prestudy vocalizer participants (Fig. 1), thereby providing a possible information source for listeners in the main study.

Fig. 1.

Vocalizer movements (A) and resultant acoustic patterning caused by movement (B). (A) Six vocalizers moved their wrist and arm in rhythmic fashion at different tempos (slow = 1.06 Hz; medium = 1.33 Hz; fast = 1.6 Hz) that was guided via a green bar digitally connected to a motion-tracking system, which represented their movement frequency relative to the target tempo. Human postures modified from ref. 23. (B) The resultant movement and acoustic data were collected. Preanalysis indeed showed that acoustics were affected by movement, with sharp peaks in the fundamental frequency (perceived as pitch; C) and the smoothed amplitude envelope of the vocalization (in purple, B) when movements reached peaks in deceleration during the stopping motion at maximum extension. Peaks in deceleration of the movement lead to counteracting muscular adjustments throughout the body recruited to keep postural integrity, which also cascade into vocalization acoustics. (D) Here we assessed how the fundamental frequency of voicing (in the human range: 75 to 450 Hz) was modulated around the maximum extension for each vocalizer and combined for all vocalizers (red line). D shows that smoothed-average-normalized F0 (also linearly detrended and z-scaled per vocalization trial) peaked around the moment of the maximum extension, when a sudden deceleration and acceleration occurred; normalized F0 dipped at steady-state low-physical-impetus moments of the movement phase (when velocity was constant), rising again for a maximum flexion (∼300 to 375 ms before and after the maximum extension), replicating previous work (21, 24). Vocalizer wrist movement showed a less pronounced F0 modulation compared to the vocalizer arm movement trials. For individual vocalizer differences for each tempo condition, see our interactive graph provided in the ). In the main study, 30 participants (listeners; 15 each cisgender males and females) were instructed to synchronize their own movements with the vocalizer’s wrist and arm movements while only having access to the vocalizations of these prestudy participants presented via a headphone (for detailed materials and method, see ). Thirty-six vocalizations (6 different vocalizers × 3 tempos × 2 vocalizer wrist vs. arm movements) were presented twice to listeners, once when they were instructed to synchronize with the vocalizer with their own wrist movement and once with their own arm movement. If listeners can synchronize the tempo and phasing of their movements to those of the vocalizers, this would provide evidence that voice acoustics may inform about bodily tensioned states—even when the vocalizer does not have an explicit goal of interpersonal communication. The ethical review committee of the University of Connecticut approved this study (approval H18-260). All participants signed an informed consent, and vocalizer prestudy participants also signed an audio release form.

Data and Materials Availability.

The hypotheses and methodology have been preregistered on the Open Science Framework (OSF; https://osf.io/ygbw5/). The data and analysis scripts supporting this study can be found on OSF (https://osf.io/9843h/).

Results

In keeping with our hypotheses, we found that listeners were able to detect and synchronize with movement from vocalizations (for these results, see Fig. 2; for detailed results, see ). Listeners reliably adjusted their wrist and arm movement tempo to the slow, medium, and fast tempos performed by the vocalizers. Furthermore, listeners’ circular means of the relative phases (Φ) were densely distributed around 0° (i.e., close to perfect synchrony), with an overall negative mean asynchrony of 45°, indicating that the listener slightly anticipated the vocalizer. Surprisingly—and against our original expectations—we even found that this held for the harder-to-detect vocalizer wrist movement. The variability of relative phase (as measured by circular SD Φ) was, however, slightly increased for wrist vs. arm vocalizations, with 0.28 increase in circular SD Φ; this indicated that listeners had greater difficulty synchronizing in phase with the vocalizer’s wrist versus arm movements.

Fig. 2.

Synchrony results. The example shows different ways movements can synchronize between the listener and the vocalizer. Fully asynchronous movement would entail a mismatch of movement tempo and a random variation of relative phases. Synchronization of phases may occur without exact matching of movement tempos. Full synchronization entails tempo matching and 0° relative phasing between vocalizer and listener movement. Main results show clear tempo synchronization, as the observed frequencies for each vocalization trial were well matched to the observed movement frequencies of listeners moving to that trial. Similarly, phase synchronization was clearly apparent, as phasing distributions are all pronouncedly peaked rather than having flat distributions, with a negative mean asynchrony regardless of vocalizer movement or movement tempo. Individual differences in vocalizer F0 modulations for each vocalizer trial were modeled using a nonlinear regression method, generalized additive modeling (GAM), providing a model fit (R2 adjusted) for each trial, indicating the degree of variability of normalized F0 modulations around moments of the maximum extension (also see Fig. 1). The variance explained for each vocalizer trial then was regressed against the average synchronization performance (average circular SD relative phase, SD Φ) of that trial by the listeners. It can be seen that more structural F0 modulations around the maximum extensions of upper limb movement (higher R2 adjusted) predict better synchronization performance (lower SD Φ), r = −0.48, P < 0.003. This means that more reliable acoustic patterning in vocalizer's voicing predicts higher listener synchronization performance. Human postures modified from ref. 23.

Discussion

We conclude that vocalizations carry information about upper limb movements of the vocalizer, given that listeners can adjust and synchronize to the movements by audition of vocalization alone. Importantly, this tempo and phase synchronization was not an artifact of chance since three different movement tempos were presented in random order. Nor are these effects reducible to idiosyncrasies in the vocalizers, as these patterns were observed across six different vocalizers with different voice acoustic qualities (e.g., cisgender male and female vocalizers). Further, vocalizers were not deliberately coupling vocal output with movement and were actually likely to try to inhibit these effects, as they had been instructed to keep their vocal output as stable as possible. Thus, the type of acoustic patterning affecting voice acoustics is a pervasive phenomenon and difficult to counteract. Our understanding of the coupling between acoustic and motor domains is enriched by the present findings that the information about bodily movement is present in acoustics. Previous research has shown, for example, that a smoothed envelope of the speech amplitude closely correlates to mouth-articulatory movements (25). Indeed, seeing (26) or even manually feeling (27) articulatory movements may resolve auditorily ambiguous sounds that are artificially morphed by experimenters, leading listeners to hear a “pa” rather than a “da” depending on the visual or haptic information of the speaker’s lips. The current results add another member to the family of acoustic–motor couplings by showing both that human voice contains acoustic signatures of hand movements and that human listeners are keenly sensitive to it. Hand gestural movements may thus have evolved as an embodied innovation for vocal control, much like other bodily constraints on acoustic properties of human vocalization (27, 28). It is well established that information about bodies from vocalizations is exploited in the wild by nonhuman species (29). For example, rhesus monkeys associate age-related body size differences of conspecifics from acoustic qualities of “coos” (30). Orangutans even try to actively exploit this relation: They cup their hands in front of their mouths when vocalizing, changing the sound quality, presumably so as to acoustically appear more threatening in size (31). Humans, too, can predict with some success the upper body strength of male vocalizers (32), especially from roaring as opposed to, for example, screaming vocalizations (33). The current results add to this literature that peripheral upper limb movements imprint their presence on the human voice as well, providing an information source about dynamically changing bodily states. An implication of the current findings is that speech recognition systems may be improved when becoming sensitive to these acoustic–bodily relations. With the current results in hand, it becomes thus possible that hearing the excitement of a friend on the phone is, in part and at times, perceived by us through the gesture-induced acoustics that are directly perceived as bodily tensions. Gestures, then, are not merely seen—they may be heard, too.

24 in total

1. Hearing lips and seeing voices.

Authors: H McGurk; J MacDonald
Journal: Nature Date: 1976 Dec 23-30 Impact factor: 49.962

Review 2. Multimodal Language Processing in Human Communication.

Authors: Judith Holler; Stephen C Levinson
Journal: Trends Cogn Sci Date: 2019-06-21 Impact factor: 20.229

3. Listening with eye and hand: cross-modal contributions to speech perception.

Authors: C A Fowler; D J Dekle
Journal: J Exp Psychol Hum Percept Perform Date: 1991-08 Impact factor: 3.332

4. Synchronization of speech and gesture: evidence for interaction in action.

Authors: Mingyuan Chu; Peter Hagoort
Journal: J Exp Psychol Gen Date: 2014-03-17

Review 5. The neurobiology of language beyond single-word processing.

Authors: Peter Hagoort
Journal: Science Date: 2019-10-04 Impact factor: 47.728

Review 6. Voice Modulation: A Window into the Origins of Human Vocal Control?

Authors: Katarzyna Pisanski; Valentina Cartei; Carolyn McGettigan; Jordan Raine; David Reby
Journal: Trends Cogn Sci Date: 2016-02-05 Impact factor: 20.229

7. Human sound systems are shaped by post-Neolithic changes in bite configuration.

Authors: D E Blasi; S Moran; S R Moisik; P Widmer; D Dediu; B Bickel
Journal: Science Date: 2019-03-15 Impact factor: 47.728

8. Vocal-tract resonances as indexical cues in rhesus monkeys.

Authors: Asif A Ghazanfar; Hjalmar K Turesson; Joost X Maier; Ralph van Dinther; Roy D Patterson; Nikos K Logothetis
Journal: Curr Biol Date: 2007-02-22 Impact factor: 10.834

Review 9. Reflections on the "gesture-first" hypothesis of language origins.

Authors: Adam Kendon
Journal: Psychon Bull Rev Date: 2017-02

10. Entrainment and Modulation of Gesture-Speech Synchrony Under Delayed Auditory Feedback.

Authors: Wim Pouw; James A Dixon
Journal: Cogn Sci Date: 2019-03

12 in total

1. Breathing, voice, and synchronized movement.

Authors: Andrea Ravignani; Sonja A Kotz
Journal: Proc Natl Acad Sci U S A Date: 2020-09-22 Impact factor: 11.205

2. Reply to Ravignani and Kotz: Physical impulses from upper-limb movements impact the respiratory-vocal system.

Authors: Wim Pouw; Alexandra Paxton; Steven J Harrison; James A Dixon
Journal: Proc Natl Acad Sci U S A Date: 2020-09-22 Impact factor: 11.205

Review 3. Vocal modulation in human mating and competition.

Authors: Susan M Hughes; David A Puts
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2021-11-01 Impact factor: 6.237

Review 4. Multilevel rhythms in multimodal communication.

Authors: Wim Pouw; Shannon Proksch; Linda Drijvers; Marco Gamba; Judith Holler; Christopher Kello; Rebecca S Schaefer; Geraint A Wiggins
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2021-08-23 Impact factor: 6.671

5. Reciprocal Influence of Mobility and Speech-Language: Advancing Physical Therapy and Speech Therapy Cotreatment and Collaboration for Adults With Neurological Conditions.

Authors: Sarah M Schwab; Sarah Dugan; Michael A Riley
Journal: Phys Ther Date: 2021-11-01