| Literature DB >> 35494539 |
Abstract
Modeling speech production and speech articulation is still an evolving research topic. Some current core questions are: What is the underlying (neural) organization for controlling speech articulation? How to model speech articulators like lips and tongue and their movements in an efficient but also biologically realistic way? How to develop high-quality articulatory-acoustic models leading to high-quality articulatory speech synthesis? Thus, on the one hand computer-modeling will help us to unfold underlying biological as well as acoustic-articulatory concepts of speech production and on the other hand further modeling efforts will help us to reach the goal of high-quality articulatory-acoustic speech synthesis based on more detailed knowledge on vocal tract acoustics and speech articulation. Currently, articulatory models are not able to reach the quality level of corpus-based speech synthesis. Moreover, biomechanical and neuromuscular based approaches are complex and still not usable for sentence-level speech synthesis. This paper lists many computer-implemented articulatory models and provides criteria for dividing articulatory models in different categories. A recent major research question, i.e., how to control articulatory models in a neurobiologically adequate manner is discussed in detail. It can be concluded that there is a strong need to further developing articulatory-acoustic models in order to test quantitative neurobiologically based control concepts for speech articulation as well as to uncover the remaining details in human articulatory and acoustic signal generation. Furthermore, these efforts may help us to approach the goal of establishing high-quality articulatory-acoustic as well as neurobiologically grounded speech synthesis.Entities:
Keywords: articulatory model; biomechanical model; speech acoustics; speech production; vocal tract
Year: 2022 PMID: 35494539 PMCID: PMC9040071 DOI: 10.3389/frobt.2022.796739
Source DB: PubMed Journal: Front Robot AI ISSN: 2296-9144
FIGURE 1Midsagittal view generated using the two-dimensional articulatory model of Kröger et al. (2014).
List of computer-implemented articulatory models (rows) and criteria for differentiating articulatory models with respect to several features (rows; see text).
| Dim. | Biomechanical vs. geometrical vs. statistical | Number of control parameters | Data | Acoustic model | Complete VT | Dynamic vs. static, all sound or less, syllables | Major goal | |
|---|---|---|---|---|---|---|---|---|
|
| 3D | statistical, linear component analysis | low (<10) | static MRI plus video facial data | no | yes | dynamic | identifying phonetic and biomechanical interpretable model control parameters |
|
| 2D | statistical, linear component analysis | low (9) | cine X-ray plus video facial data | yes | yes | dynamic | identifying the degrees of freedom (i.e., the number of control parameters) for an articulatory model (statistical model) |
|
| 3D | geometric; parametric | middle (15) | static MRI data | yes | yes | dynamic | high quality speech synthesis |
|
| 3D | muscle force model; biomechanical tissue model | middle (11) | static MRI and CT data for vowels (geometries); EMG data (muscle activation) | yes | yes | static V-sounds | muscle force levels for different French vowels |
|
| 2D | geometric; parametric | low (<10) | static X-ray data | yes | yes | dynamic | speech synthesis by rule; developing an approach for articulatory commands |
|
| 2D | muscle force model; biomechanical tissue model | middle (10 for tongue) | static MRI data | no | tongue + VT walls | movements towards V- and C-sound equilibrium positions | identifying agonist-antagonist muscle groups (muscle synergies) for V- can C-sounds |
|
| 3D | statistical; ordered linear factor analysis | low (6 for tongue) | static MRI plus EMA, EPG | no | tongue | static V- and C-sounds VC-sequences with C = fricative | identifying kinematic model control parameters; developing methods for including EMA and EPG data for modeling tongue movements |
|
| 3D | muscle activation + force model; biome-chanical tissue model | high (21 for tongue and jaw) | tagged MRI plus cine MRI data | formants | tongue + VT walls | tongue forward-backward movement | specifying speaker-specific muscle activation patterns based on tagged and cine MRI data |
|
| 2D | geometric; parametric | Non-parametric “goal-seeking” approach | cine X-ray data | transfer function | yes | dynamic | specifying control concepts for articulatory movements and modeling coarticulation |
|
| 2D | geometric; parametric | low (<10) | static X-ray data, ultrasound, static MRI | yes | yes | dynamic | research tool; testing gesture patterns by perception |
|
| 2D | geometric; parametric | low (<10) | static MRI data | yes | yes | dynamic | midsagittal views of dynamic articulation for teaching and as tool in speech therapy |
|
| 2D | statistical; principal component analysis; growth model | low (7), see also | cine X-ray data | yes | yes | dynamic | research tool; identification of model control parameters |
|
| 2D | geometric; parametric | low (<10) | static X-ray data | yes | yes | VCV-sequences | VCV-sequences; coarticulation; speech synthesis; research tool |
|
| 2D | muscle activation + force model (lambda model); biomechanical tissue model | low (<10 for tongue) | qualitative comparison with CVC movement data extracted from literature | no | tongue | VCV-sequences | VCV-sequences; C = velar consonant; tongue body movement during C (loops) |
|
| 2D | muscle activation + force model (lambda model); biomechanical tissue model | middle and low (17 muscles -> 6 factors explain a variance of 75%; tongue + jaw + hyoid) | cine X-ray data | no | tongue, hyoid, larynx, lower jaw | periodic jaw and tongue movements | organization of control signals; dynamic behavior of articulators; identifying muscle synergies |
|
| 2D | statistical; principal component analysis; speaker-specific | middle (14) | static MRI data | no | yes | static V- and C-sounds | estimation of control parameters; based on 11 different speakers; generating individual models and a mean speaker model |
|
| 3D | statistical; generic surface triangular mesh; principal component analysis | low (2 for velum) | static MRI and CT plus EMA data | yes | velum + naso-pharyngeal wall | static V- and C-sounds | identifying geometric model control parameters; modeling velum movements using additional EMA data; resynthesis of nasals |
|
| 2D | geometric | low (<10) | cine X-ray data | yes | yes | dynamic | speaker-specific vocal tract geometries for short sound sequences |
|
| 1D | parametric area-function | middle (16) | static MRI data | yes | yes | dynamic | high quality and real time speech synthesis |
|
| 1D | parametric area-function model; speaker-dependent; growth | low (<10) for static vowel model; middle (14) for dynamic model | static CT and MRI data | yes | area functions including nasal tract | dynamic (VV, VCV and VCCV utterances) or static V- and C-sounds | articulatory-acoustic relations for males/females for newborns/children/adults; high-quality speech synthesis of isolated sounds and of sound sequences |
|
| 3D | muscle activation + force model; biomechanical tissue model | low (<10 at higher control level); middle (<20 at lower control level; tongue) | static MRI data | no | tongue | static tongue configurations | physiologically based computer simulation of speech production; research tool |
FIGURE 2Visualization of the evolution of articulatory models over time. Models are cited here by the first author, as listed in Table 1. Main research goals and two criteria for differentiating articulatory models (biomechanical vs. geometrical and statistical models; 1D and 2D models vs. 3D models) are labeled and visualized.
FIGURE 3Hierarchical organization of articulatory models and their control modules and their levels of control (blue boxes).