| Literature DB >> 23818883 |
Claudia Canevari1, Leonardo Badino, Alessandro D'Ausilio, Luciano Fadiga, Giorgio Metta.
Abstract
Classical models of speech consider an antero-posterior distinction between perceptive and productive functions. However, the selective alteration of neural activity in speech motor centers, via transcranial magnetic stimulation, was shown to affect speech discrimination. On the automatic speech recognition (ASR) side, the recognition systems have classically relied solely on acoustic data, achieving rather good performance in optimal listening conditions. The main limitations of current ASR are mainly evident in the realistic use of such systems. These limitations can be partly reduced by using normalization strategies that minimize inter-speaker variability by either explicitly removing speakers' peculiarities or adapting different speakers to a reference model. In this paper we aim at modeling a motor-based imitation learning mechanism in ASR. We tested the utility of a speaker normalization strategy that uses motor representations of speech and compare it with strategies that ignore the motor domain. Specifically, we first trained a regressor through state-of-the-art machine learning techniques to build an auditory-motor mapping, in a sense mimicking a human learner that tries to reproduce utterances produced by other speakers. This auditory-motor mapping maps the speech acoustics of a speaker into the motor plans of a reference speaker. Since, during recognition, only speech acoustics are available, the mapping is necessary to "recover" motor information. Subsequently, in a phone classification task, we tested the system on either one of the speakers that was used during training or a new one. Results show that in both cases the motor-based speaker normalization strategy slightly but significantly outperforms all other strategies where only acoustics is taken into account.Entities:
Keywords: acoustic-to-articulatory mapping; automatic speech classification; deep neural networks; mirror neurons; phone classification; speaker normalization; speech imitation
Year: 2013 PMID: 23818883 PMCID: PMC3694210 DOI: 10.3389/fpsyg.2013.00364
Source DB: PubMed Journal: Front Psychol ISSN: 1664-1078
Figure 1Computational approach and model design (A) Architecture of the phone classifier. Ten vectors of 60 Mel-filtered Spectra Coefficients (MFSCs) and, when articulatory features are used (dashed lines), 10 vectors of 42 articulatory features (reconstructed from the MFSCs though the acoustic-to-articulatory mapping, in panel B) are fed into a Deep Neural Network. The Deep Neural Network computes the posterior probabilities of each of the 43 phones (output layer) given the acoustic and articulatory evidence and then the most probable phone is selected. (C) Implementation details of the deep neural network that carries out the acoustic-to-articulatory mapping for motor-based normalization. (D) Implementation details of the deep neural network that carries out the acoustic-to-acoustic mapping for the acoustic normalization.
Normalization strategies and training and testing settings.
The combination of five normalization strategies and two main training and testing scenarios results in 10 different phone classifiers. Acoustic features (AuFs) in bold-line rectangular are the actual 60 MFSCs. Acoustic features in shadowed dashed-line rectangles are the reconstructed audio features in the listener domain. They can be all 60 MFSCs as in AcouNorm_A and AcouNorm_B or the K first MFSCs as in AcouNorm_C. Articulatory features (AFs) in shadowed dashed-line rectangles are the 42 reconstructed listener articulatory features. In the NoNorm scenario no normalization is applied while in AcouNorm_A the listener acoustic features reconstructed from speaker S1 are used together with the actual ones only during training.
Feature sets for training and testing of the acoustic-to-articulatory and the acoustic-to-acoustic mappings.
| AcouNorm_A | |||
| AcouNorm_B | |||
| AcouNorm_C | |||
| MotorNorm | |||
This table shows the input-output feature pairs for training and testing the DNNs performing acoustic-to-articulatory mappings and acoustic-to-acoustic mapping.
Feature set of all the phone classifiers.
| NoNorm_T1 | ||
| NoNorm_T2 | ||
| AcouNorm_A_T1 | ||
| AcouNorm_A_T2 | ||
| AcouNorm_B_T1 | ||
| AcouNorm_B_T2 | ||
| AcouNorm_C_T1 | ||
| AcouNorm_C_T2 | ||
| MotorNorm_T1 | ||
| MotorNorm_T2 |
L is the listener and is the only subject whose actual motor data (used for Acoustic-to-Articulatory Mapping) is available. S1 is the speaker whose data is always used during training and is tested in the T1 training and testing setting. S2 can only be tested (T2 setting). Audio.
Overall phone classification error rate.
| 25.2 (25.3) | 20.9 (21) | 26.6 (27.1) | |
| 29.9 (29.8) | 26.7 (27.2) | 34.3 (34.4) | |
| 23.8 (24.1) | 19.8 (19.9) | 25.9 (26.6) | |
| 24.6 (24.7) | 20.4 (20.5) | 26.5 (26.9) | |
| 23.9 (23.6) | 19.7 (19.4) | 26.1 (26.2) |
Phone classification error rates averaged over all listener cases of all normalization strategies in all training and testing settings. In parenthesis the phone error rate values obtained by removing listener 2.
Figure 2Phone classification error rates for NoNorm, MotorNorm, and AcouNorm_C. Top panel: Phone classification error rate in the T1_1Tr setting (all listener data plus 1/3 of the S1 speaker data used for training, and 2/3 for testing). The classification error rate of each listener is averaged over the four listener-speaker pairs. Middle panel: Phone classification error rate in the T1_2Tr setting (all listener data plus 2/3 of the S1 speaker data used for training, and 1/3 for testing). The classification error rate of each listener is averaged over the four listener-speaker pairs. Bottom panel: Phone classification error rate in the T2 scenario(all listener data plus 2/3 of the S1 speaker acoustic data used for training, and 1/3 of S2 speaker data used for testing) The classification error rate of each listener is averaged over all 12 < L listener, S1 speaker, S2 speaker > triplets.
Figure 3Four average correlations. The four Pearson product moment correlation coefficients are computed for each listener and averaged over all speakers. Average correlation between articulatory features (AFs) of the same subject extracted from different instances of the same word type (circles). Average correlation between the listener actual AFs and the corresponding speaker actual AFs (squares). Average correlation between the listener actual AFs and corresponding AFs recovered through acoustic-to-articulatory mapping (triangles). Average correlation between the speaker actual AFs and the corresponding AFs recovered from the same speaker acoustics (crosses).