| Literature DB >> 33921549 |
Abdelrahman Ahmed1, Khaled Shaalan2, Sergio Toral1, Yasser Hifny3.
Abstract
The paper proposes three modeling techniques to improve the performance evaluation of the call center agent. The first technique is speech processing supported by an attention layer for the agent's recorded calls. The speech comprises 65 features for the ultimate determination of the context of the call using the Open-Smile toolkit. The second technique uses the Max Weights Similarity (MWS) approach instead of the Softmax function in the attention layer to improve the classification accuracy. MWS function replaces the Softmax function for fine-tuning the output of the attention layer for processing text. It is formed by determining the similarity in the distance of input weights of the attention layer to the weights of the max vectors. The third technique combines the agent's recorded call speech with the corresponding transcribed text for binary classification. The speech modeling and text modeling are based on combinations of the Convolutional Neural Networks (CNNs) and Bi-directional Long-Short Term Memory (BiLSTMs). In this paper, the classification results for each model (text versus speech) are proposed and compared with the multimodal approach's results. The multimodal classification provided an improvement of (0.22%) compared with acoustic model and (1.7%) compared with text model.Entities:
Keywords: BiLSTM; CNNs; attention layer; multimodal classification; performance modeling
Mesh:
Year: 2021 PMID: 33921549 PMCID: PMC8069216 DOI: 10.3390/s21082720
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1The proposed framework illustrates two schemes for speech and text. The dotted lines indicate the multimodal approach for merging one path for each scheme and forward it to the output layer.
Figure 2The Study Neural Networks Structure Units.
65 provided Low-Level Descriptors (LLD).
| 54 spectral LLD |
| RASTA-style auditory spectrum |
| MFCC 1–14 |
| Spectral energy |
| Spectral Roll Off Point |
| Entropy, Spectral Flux, Skewness, Variance, Kurtosis, |
| Slope, Harmonicity, Psychoacoustic Sharpness |
| 7 voicing related LLD |
| Probability of voicing, F0 by SHS - Viterbi smoothing |
| Jitter, logarithmic HNR, Shimmer |
| PCM fftMag spectral Centroid SMA numeric |
| 4 energy related LLD |
| Sum of auditory spectrum |
| Sum of RASTA-style filtered auditory spectrum |
| RMS Energy |
| Zero-Crossing Rate |
Accuracy (Speech Processing) comparison.
| Speech Accuracy % per Model Type | ||
|---|---|---|
|
|
|
|
| CNNs | MFCC | 82.7% |
| CNNs-Attention | MFCC | 84.27% |
| CNNs-BiLSTMs | MFCC | 83.55% |
| CNNs-BiLSTMs-Attention | MFCC | 83.54% |
| CNNs | LLD | 90.1% |
| CNNs-Attention | LLD | 92.48% |
| CNNs-Attention + MWS | LLD | 92.88% |
| CNNs-BiLSTMs | LLD | 92.67% |
| CNNs-BiLSTMs-Attention | LLD | 92.68% |
| CNNs-BiLSTMs-Attention + MWS | LLD | 92.25% |
Accuracy (Text Processing) comparison.
| Accuracy % per Model Type | ||
|---|---|---|
|
|
|
|
| Naive Bayes | Bag of words | 67.3% |
| Logistic Regression | Bag of words | 80.76% |
| Linear Support Vector Machine (LSVM) | Bag of words | 82.69% |
| CNNs | Word Embedding | 90.73% |
| CNNs-Attention | Word Embedding | 90.98% |
| CNNs-Attention+MWS | Word Embedding | 91.4% |
| CNNs-BiLSTMs | Word Embedding | 89.87% |
| CNNs-BiLSTMs-Attention | Word Embedding | 91.19% |
| CNNs-BiLSTMs-Attention+MWS | Word Embedding | 91.12% |
Accuracy (Multimodal models) comparison.
| Multimodal Accuracy % per Model Type | ||
|---|---|---|
|
|
|
|
| CNNs | CNNs | 90.44% |
| CNNs-Attention | CNN | 90.1% |
| CNN | CNNs-Attention | 92.63% |
| CNN | CNNs-Attention + MWS | 92.9% |
| CNN-Attention | CNNs-Attention | 91.76% |
| CNN-Attention + MWS | CNNs-Attention + MWS | 93.1% |
| CNNs | CNNs-BiLSTMs-Attention | 91.8% |
| CNNs | CNNs-BiLSTMs-Attention + MWS | 91.9% |
| CNNs-Attention | CNNs-BiLSTMs | 90.36% |
| CNNs-Attention + MWS | CNNs-BiLSTMs | 91.1% |
| CNNs-Attention | CNNs-BiLSTMs-Attention | 91% |
| CNNs-Attention + MWS | CNNs-BiLSTMs-Attention + MWS | 91.1% |
Figure 3Modeling Approaches.
The Table Compares the MWS method with Softmax used in Attention Layer.
| MWS vs. Softmax—Accuracy Improvement% | |||
|---|---|---|---|
|
|
|
|
|
| Softmax | 92.68% | 90.98% | 91.76% |
| MWS | 92.88% | 91.4% | 93.1% |
| Delta | 0.2% | 0.42% | 1.34% |