| Literature DB >> 27992480 |
Guillaume Lemaitre1, Olivier Houix1, Frédéric Voisin1, Nicolas Misdariis1, Patrick Susini1.
Abstract
Imitative behaviors are widespread in humans, in particular whenever two persons communicate and interact. Several tokens of spoken languages (onomatopoeias, ideophones, and phonesthemes) also display different degrees of iconicity between the sound of a word and what it refers to. Thus, it probably comes at no surprise that human speakers use a lot of imitative vocalizations and gestures when they communicate about sounds, as sounds are notably difficult to describe. What is more surprising is that vocal imitations of non-vocal everyday sounds (e.g. the sound of a car passing by) are in practice very effective: listeners identify sounds better with vocal imitations than with verbal descriptions, despite the fact that vocal imitations are inaccurate reproductions of a sound created by a particular mechanical system (e.g. a car driving by) through a different system (the voice apparatus). The present study investigated the semantic representations evoked by vocal imitations of sounds by experimentally quantifying how well listeners could match sounds to category labels. The experiment used three different types of sounds: recordings of easily identifiable sounds (sounds of human actions and manufactured products), human vocal imitations, and computational "auditory sketches" (created by algorithmic computations). The results show that performance with the best vocal imitations was similar to the best auditory sketches for most categories of sounds, and even to the referent sounds themselves in some cases. More detailed analyses showed that the acoustic distance between a vocal imitation and a referent sound is not sufficient to account for such performance. Analyses suggested that instead of trying to reproduce the referent sound as accurately as vocally possible, vocal imitations focus on a few important features, which depend on each particular sound category. These results offer perspectives for understanding how human listeners store and access long-term sound representations, and sets the stage for the development of human-computer interfaces based on vocalizations.Entities:
Mesh:
Year: 2016 PMID: 27992480 PMCID: PMC5161510 DOI: 10.1371/journal.pone.0168167
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
The selection of basic mechanical interactions, classified in morphological profiles.
T = target category; D = distractor category. Descriptions between quotes were provided to the experimental participants (see Section 3). *Sounds with a strong tonal component.
| Morphological profile | Category | Description | |
|---|---|---|---|
| Discrete | Impulsive | Shooting (T) | Shooting, an explosion. |
| Hitting* (D) | Tapping on a board. Ringing a bell. | ||
| Slow onset/repeated | Scraping (T) | Scraping, grating, rubbing an object. | |
| Whipping (D) | The whoosh of a whip. | ||
| Continuous | Stationary | Gushing (T) | Water gushing, flowing |
| Blowing (D) | Wind blowing; Blowing air through a pipe. | ||
| Complex | Rolling (T) | An object rolling down a surface. | |
| Filling (D) | Filling a small container. | ||
The selection of sounds of manufactured products, classified in morphological profiles.
T = target category; D = distractor category. Descriptions between quotes were provided to the experimental participants (see Section 3). *Sounds with a strong tonal component.
| Morphological profile | Category | Description | |
|---|---|---|---|
| Discrete | Impulsive | Buttons and switches (T) | A switch, a button, a computer key. |
| Doors closing (D) | Closing a door. | ||
| Slow onset/Repeated | Saws and files (T) | A person sawing or sanding an object. | |
| Windshield wipers* (D) | Windshield wipers wiping the windshield. | ||
| Continuous | Stationary | Refrigerator* (T) | A refrigerator’s hum. |
| Blenders* (D) | Food processors switched on, processing food, and switched off. | ||
| Complex | Printers* (T) | A printer or a fax printing pages. | |
| Revs up* (D) | Cars and motorcycles revving up. | ||
Fig 1Method to create auditory sketches.
Parameters used to synthesize the sketches.
| Parameters | Q1 | Q2 | Q3 |
|---|---|---|---|
| Coefficients per second | 160 | 800 | 4000 |
| Temporal resolution (LPC model) | 44 ms | 20 ms | 9 ms |
| LPC coefficients (LPC model) | 7 | 16 | 36 |
Fig 2Structure of the identification experiment.
Fig 3Discrimination sensitivity indices (d′) and accuracy (assuming no bias) for the four morphological profiles in the family of product sounds.
The left panels represent the data for the ten imitators (I32, I23, etc. are the code of each imitator). The right panels zoom on the best imitation for each morphological profile and compare it to the three auditory sketches and the referent sounds. Gray shadings represent the quality of the sketches (from light gray—Q1—to dark gray—Q3—and to black—referent sound). The right panels also represent the results of four t-tests comparing the best imitator to each of the three auditory sketches and the referent sounds. When the best imitation is not significantly different from an auditory sketch (with an alpha-value of.05/4), it receives the same shading. Vertical bars represent the 95% confidence interval of the mean. *significantly different from chance level after Bonferroni correction (p<.05/4).
Fig 4Discrimination sensitivity indices (d′) and accuracy (assuming no bias) for the four morphological profiles in the family of basic mechanical interactions.
See Fig 3 for detail.
Fig 5Indices of discrimination sensitivity (d′) as a function of the auditory distance between each sound and its corresponding referent sound.
Auditory differences are calculated by computing the cost of aligning the auditory spectrograms of the two sounds [76]. Circles represent the referent sounds (in this case, the distance is therefore null) and the three sketches. Black stars represent the ten imitators. The dashed line represents the regression line between the auditory distances and the d′ values.
Fig 6Indices of discrimination sensitivity (d′) as a function of the feature distance between each sound and its corresponding referent sound.
Feature differences are calculated by computing the Euclidean norm of the features defined by [80].
Phonological description of the best and worst vocal imitations.
The last column suggests the cues that are important for successful identification. M = Morphological profile; I = impulse; R = repeated; S = stationary; C = complex; SO = slow onset.
| M | Best imitations | Worst imitations | Cue to identify |
|---|---|---|---|
| Product sounds | |||
| I | Two rapid clicks | One single click or two slow clicks | Two rapid clicks |
| R | Rhythm of egressive & ingressive streams and fricatives | Irregular sequence of trills and fricatives | Regular repetition egressive & ingressive streams + fricatives |
| S | Continuous voiced part with (modulated) fricatives; initial occlusion | Continuous egressive stream with initial occlusion + fricatives | Voiced part, low |
| C | Sequence of voiced and fricative parts + egressive & ingressive streams | Unstructured sequence of egressive streams + voiced and fricative parts | Structured voiced + fricative parts |
| Interaction | |||
| I | Short occlusion + decreasing egressive stream and fricatives | Occlusion + egressive stream with trill or fricative parts | Short occlusion + decreasing turbulent stream + fricatives |
| SO | Alternate or modulated fricative parts (+ trills) | Egressive stream with some fricatives, rhythm is irregular | fricatives + regular rhythm |
| S | Egressive stream with fricatives with fine regular texture | Fine texture with timbre variations | Fricatives + fine regular texture |
| C | Continuous breathy voiced part with trills or fricatives, or sequence of trills & fricatives, with increasing pitch or spectral centroid | Sustained voiced note; sequence of clicks | Fricatives + trills + spectral increase |