| Literature DB >> 27135054 |
Jody Kreiman1, Bruce R Gerratt1, Marc Garellek2, Robin Samlan3, Zhaoyan Zhang1.
Abstract
At present, two important questions about voice remain unanswered: When voice quality changes, what physiological alteration caused this change, and if a change to the voice production system occurs, what change in perceived quality can be expected? We argue that these questions can only be answered by an integrated model of voice linking production and perception, and we describe steps towards the development of such a model. Preliminary evidence in support of this approach is also presented. We conclude that development of such a model should be a priority for scientists interested in voice, to explain what physical condition(s) might underlie a given voice quality, or what voice quality might result from a specific physical configuration.Entities:
Keywords: acoustics; modeling; synthesis; voice production; voice quality
Year: 2014 PMID: 27135054 PMCID: PMC4847936 DOI: 10.3989/loquens.2014.009
Source DB: PubMed Journal: Loquens ISSN: 2386-2637
Figure 1The speech chain, describing the transmission of information from a speaker to a listener. The speaker’s brain generates an intent to phonate and a set of commands to the relevant muscles; sound is generated when the articulators modulate airflow through the glottis and vocal tract; this sound is transduced by the listener’s ear and transformed into neural messages, which are perceived and interpreted by the listener’s brain. Adapted from Denes and Pinson (1993).
Figure 2The four-parameter source spectral model, fitted to the spectrum of a natural voice. The voice source was estimated via inverse filtering, and its spectrum was then calculated via fast Fourier transform. Differences in the amplitudes of individual harmonics are altered so that they conform to the slope of the appropriate model segment.
Components of the psychoacoustic model of voice quality and associated voice synthesis parameters.
| Model Component | Parameters |
|---|---|
| Harmonic source spectral shape | H1–H2 |
| H2–H4 | |
| H4-2 kHz | |
| 2 kHz-5 kHz | |
| Inharmonic source excitation | Spectrally-shaped noise-to-harmonics ratio |
| Time-varying source characteristics | |
| Amplitude mean and standard deviation (or amplitude track) | |
| Vocal tract transfer function | Formant frequencies/bandwidths |
| Spectral zeroes/bandwidths |
The ratio of listener sensitivity (JND) to parameter variability across speakers, for the four source model parameters. Data from Kreiman et al. (in preparation).
| Female speakers | Male speakers | |
|---|---|---|
| H1–H2 | 0.17 | 0.24 |
| H2–H4 | 0.09 | 0.13 |
| H4-2 kHz | 0.09 | 0.09 |
| 2 kHz-5 kHz | 0.26 | 0.29 |
Figure 3The two-layer cover-body vocal fold model used in Zhang et al. (2013).
Figure 4The user interface from the sort-and-rate perceptual task. Listeners click on an icon to play a voice sample, then drag the icons so that those that sound similar are placed close together on the line, and those that sound different are farther apart.
| Video file: | asymm_male.mp4 |
| Audio file: | asymm_male.mp3 |
| Example 1: | female1_natural.mp3 |
| female1_synthetic.mp3 | |
| Example 2: | female2_natural.mp3 |
| female2_synthetic.mp3 | |
| Example 3: | male1_natural.mp3 |
| male1_synthetic.mp3 | |
| Example 4: | male2_natural.mp3 |
| male2_synthetic.mp3 |