| Literature DB >> 28982171 |
Evaldas Vaiciukynas1,2, Antanas Verikas1,3, Adas Gelzinis1, Marija Bacauskiene1.
Abstract
This study investigates signals from sustained phonation and text-dependent speech modalities for Parkinson's disease screening. Phonation corresponds to the vowel /a/ voicing task and speech to the pronunciation of a short sentence in Lithuanian language. Signals were recorded through two channels simultaneously, namely, acoustic cardioid (AC) and smart phone (SP) microphones. Additional modalities were obtained by splitting speech recording into voiced and unvoiced parts. Information in each modality is summarized by 18 well-known audio feature sets. Random forest (RF) is used as a machine learning algorithm, both for individual feature sets and for decision-level fusion. Detection performance is measured by the out-of-bag equal error rate (EER) and the cost of log-likelihood-ratio. Essentia audio feature set was the best using the AC speech modality and YAAFE audio feature set was the best using the SP unvoiced modality, achieving EER of 20.30% and 25.57%, respectively. Fusion of all feature sets and modalities resulted in EER of 19.27% for the AC and 23.00% for the SP channel. Non-linear projection of a RF-based proximity matrix into the 2D space enriched medical decision support by visualization.Entities:
Mesh:
Year: 2017 PMID: 28982171 PMCID: PMC5628839 DOI: 10.1371/journal.pone.0185613
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Summary of the database: Numbers correspond to the count of subjects (recordings).
| Phonation | Speech | Fusion | ||||
|---|---|---|---|---|---|---|
| AC | SP | AC | SP | AC | SP | |
| HC male | 11 (33) | 11 (33) | 11 | 11 | 11 (33) | 11 (33) |
| HC female | 24 (72) | 24 (72) | 24 | 24 | 24 (72) | 24 (72) |
| HC total | 35 (105) | 35 (105) | 35 | 35 | 35 (105) | 35 (105) |
| PD male | 30 (89) | 30 (90) | 29 | 30 | 29 (85) | 30 (90) |
| PD female | 34 (101) | 34 (102) | 34 | 34 | 34 (101) | 34 (102) |
| PD total | 64 (190) | 64 (192) | 63 | 64 | 63 (186) | 64 (192) |
| 99 (295) | 99 (297) | 98 | 99 | 98 (291) | 99 (297) | |
Notes. Subject: PD—Parkinson’s disease patient, HC—healthy control subject. Microphone: AC—acustic cardioid, SP—smart phone.
List of the individual feature sets.
| # | Feature set name | Size | Reference |
|---|---|---|---|
| 1 | avec2011 | 1941 | [ |
| 2 | avec2013 | 2268 | [ |
| 3 | emo_large | 6552 | [ |
| 4 | emobase | 988 | [ |
| 5 | emobase2010 | 1582 | [ |
| 6 | IS09_emotion | 384 | [ |
| 7 | IS10_paraling | 1582 | [ |
| 8 | IS10_paraling_compat | 1582 | [ |
| 9 | IS11_speaker_state | 4368 | [ |
| 10 | IS12_speaker_trait | 5757 | [ |
| 11 | IS12_speaker_trait_compat | 6125 | [ |
| 12 | IS13_ComParE | 6373 | [ |
| 13 | Essentia descriptors | 1915 | [ |
| 14 | MPEG7 descriptors | 527 | [ |
| 15 | KTU features | 1267 | [ |
| 16 | jAudio features | 1794 | [ |
| 17 | YAAFE features | 1885 | [ |
| 18 | Tsanas features | 339 | [ |
Overview of the emobase.conf file settings.
| Low-level descriptors | Statistical functionals |
|---|---|
| intensity, loudness, pitch, pitch envelope, 12 MFCCs, 8 frequencies of line spectral pairs, probability of voicing, zero-crossing rate | min (or max) value and its relative position in a signal, range, arithmetic mean, standard deviation, skewness, kurtosis, 3 quartiles, 3 inter-quartile ranges, 2 linear regression coefficients, linear and quadratic error |
OOB detection performance by Cllr using individual feature sets.
| # | Feature set name | Modalities using AC channel | Modalities using SP channel | ||||||
|---|---|---|---|---|---|---|---|---|---|
| P | S | V | U | P | S | V | U | ||
| 1 | avec2011 | 0.802 | 0.693 | 0.782 | 0.708 | 0.791 | 0.806 | 0.871 | 0.683 |
| 2 | avec2013 | 0.794 | 0.749 | 0.813 | 0.726 | 0.809 | 0.833 | 0.879 | 0.665 |
| 3 | emo_large | 0.744 | 0.758 | 0.768 | 0.724 | 0.834 | 0.775 | 0.856 | 0.760 |
| 4 | emobase | 0.718 | 0.737 | 0.758 | 0.666 | 0.857 | 0.776 | 0.833 | 0.637 |
| 5 | emobase2010 | 0.837 | 0.854 | 0.735 | 0.777 | 0.830 | 0.778 | 0.792 | 0.734 |
| 6 | IS09_emotion | 0.906 | 0.778 | 0.807 | 0.734 | 0.842 | 0.804 | 0.839 | 0.742 |
| 7 | IS10_paraling | 0.841 | 0.833 | 0.750 | 0.777 | 0.832 | 0.738 | 0.787 | 0.723 |
| 8 | IS10_paraling_compat | 0.838 | 0.879 | 0.706 | 0.777 | 0.826 | 0.764 | 0.792 | 0.729 |
| 9 | IS11_speaker_state | 0.822 | 0.722 | 0.767 | 0.636 | 0.838 | 0.777 | 0.758 | 0.737 |
| 10 | IS12_speaker_trait | 0.822 | 0.724 | 0.735 | 0.649 | 0.822 | 0.741 | 0.773 | 0.766 |
| 11 | IS12_speaker_trait_compat | 0.814 | 0.727 | 0.758 | 0.816 | 0.739 | 0.795 | 0.734 | |
| 12 | IS13_ComParE | 0.819 | 0.701 | 0.745 | 0.641 | 0.817 | 0.755 | 0.783 | 0.767 |
| 13 | Essentia_descriptors | 0.747 | 0.912 | 0.804 | 0.839 | 0.713 | |||
| 14 | MPEG7_descriptors | 0.665 | 0.623 | 0.798 | 0.753 | 0.910 | 0.844 | 0.911 | 0.745 |
| 15 | KTU_features | 0.810 | 0.770 | 0.780 | 0.767 | 0.930 | 0.805 | 0.837 | 0.707 |
| 16 | jAudio_features | 0.806 | 0.772 | 0.893 | 0.720 | 0.886 | 0.817 | 0.692 | |
| 17 | YAAFE_features | 0.717 | 0.761 | 0.770 | 0.713 | 0.892 | 0.701 | 0.812 | |
| 18 | Tsanas | 0.790 | 0.762 | 0.719 | 0.719 | 0.747 | 0.700 | ||
Notes. Microphone: AC—acustic cardioid, SP—smart phone. Modality: P—phonation, S—speech, V—voiced part of speech, U—unvoiced part of speech.
OOB detection performance by EER (in %) using individual feature sets.
| # | Feature set name | Modalities using AC channel | Modalities using SP channel | ||||||
|---|---|---|---|---|---|---|---|---|---|
| P | S | V | U | P | S | V | U | ||
| 1 | avec2011 | 30.65 | 27.40 | 31.38 | 28.46 | 30.74 | 34.08 | 36.67 | 26.96 |
| 2 | avec2013 | 32.09 | 28.76 | 31.61 | 29.59 | 32.64 | 34.90 | 38.41 | 27.32 |
| 3 | emo_large | 25.74 | 29.58 | 28.57 | 30.64 | 29.17 | 30.18 | 34.33 | 32.00 |
| 4 | emobase | 24.14 | 26.41 | 27.82 | 25.54 | 32.59 | 32.78 | 36.15 | 26.07 |
| 5 | emobase2010 | 31.84 | 35.76 | 32.22 | 35.71 | 30.89 | 30.17 | 31.34 | 31.34 |
| 6 | IS09_emotion | 37.03 | 28.06 | 33.86 | 32.56 | 34.26 | 32.17 | 33.82 | 28.33 |
| 7 | IS10_paraling | 32.59 | 34.01 | 31.71 | 35.20 | 31.89 | 30.53 | 32.62 | 31.54 |
| 8 | IS10_paraling_compat | 31.04 | 36.25 | 31.15 | 34.75 | 30.51 | 30.74 | 31.90 | 30.80 |
| 9 | IS11_speaker_state | 31.24 | 30.83 | 33.02 | 24.95 | 32.02 | 31.35 | 28.28 | |
| 10 | IS12_speaker_trait | 32.74 | 30.14 | 31.90 | 26.87 | 31.93 | 30.35 | 31.91 | 31.08 |
| 11 | IS12_speaker_trait_compat | 30.51 | 29.66 | 31.43 | 25.51 | 31.51 | 29.85 | 31.65 | 31.34 |
| 12 | IS13_ComParE | 32.14 | 30.08 | 33.16 | 31.99 | 30.17 | 31.55 | 31.33 | |
| 13 | Essentia_descriptors | 31.60 | 39.01 | 31.62 | 31.36 | 27.57 | |||
| 14 | MPEG7_descriptors | 21.25 | 22.19 | 32.26 | 31.38 | 38.54 | 32.36 | 37.25 | 27.11 |
| 15 | KTU_features | 29.22 | 30.11 | 29.53 | 31.64 | 43.11 | 29.10 | 33.22 | 29.13 |
| 16 | jAudio_features | 30.59 | 31.34 | 35.89 | 29.59 | 33.92 | 29.53 | 28.50 | |
| 17 | YAAFE_features | 23.61 | 29.67 | 27.17 | 27.66 | 35.99 | 29.03 | 28.43 | |
| 18 | Tsanas | 29.16 | 30.09 | 31.18 | 26.92 | 27.52 | 30.91 | 26.70 | |
Notes. Microphone: AC—acustic cardioid, SP—smart phone. Modality: P—phonation, S—speech, V—voiced part of speech, U—unvoiced part of speech.
Fig 1OOB detection performance by the DET curves.
Microphone: AC (left) and SP (right). The best individual feature set (and corresponding modality): Essentia (P, S, V) and IS13_ComParE (U) using AC; Tsanas (P), jAudio (S), IS11_speaker_state (V) and YAAFE (U) using SP. Multimodal fusion (F) of all individual feature sets from all modalities.
Performance measures for 4 unimodal and 6 multimodal decision-level fusions.
| Fusion | AC channel | SP channel | ||
|---|---|---|---|---|
| Cllr | EER, % | Cllr | EER, % | |
| P | 0.583 (0.004) | 21.05 (0.16) | 0.804 (0.004) | 32.81 (0.26) |
| S | 0.578 (0.006) | 21.96 (0.22) | 0.660 (0.007) | 25.33 (0.28) |
| V | 0.576 (0.004) | 25.09 (0.50) | 0.739 (0.005) | 25.96 (0.25) |
| U | 0.660 (0.007) | 26.36 (0.55) | 0.672 (0.004) | 25.21 (0.42) |
| P+S | 0.585 (0.004) | 21.09 (0.22) | 0.676 (0.006) | 23.90 (0.38) |
| S+V | 0.579 (0.004) | 22.55 (0.24) | 0.686 (0.005) | 23.58 (0.38) |
| S+U | 0.566 (0.006) | 22.32 (0.26) | 25.36 (0.35) | |
| V+U | 0.567 (0.005) | 24.73 (0.34) | 0.697 (0.007) | 24.48 (0.65) |
| S+V+U | 23.08 (0.39) | 0.660 (0.007) | 25.00 (0.49) | |
| P+S+V+U | 0.563 (0.004) | 0.652 (0.006) | ||
Notes. Fusion was repeated 99 times to estimate the mean (standard deviation). Microphone: AC—acustic cardioid, SP—smart phone. Modality: P—phonation, S—speech, V—voiced part of speech, U—unvoiced part of speech.
Fig 2Visualization of the meta-RF proximity matrix by the t-SNE.
Microphone: AC (left) and SP (right). Recording from: PD (designated by a red square □) and HC subject (designated by a blue circle ○).