| Literature DB >> 34344372 |
Jia Fu1, Sen Yang1, Fei He1, Ling He2, Yuanyuan Li3, Jing Zhang1, Xi Xiong4.
Abstract
BACKGROUND: Schizophrenia is a chronic and severe mental disease, which largely influences the daily life and work of patients. Clinically, schizophrenia with negative symptoms is usually misdiagnosed. The diagnosis is also dependent on the experience of clinicians. It is urgent to develop an objective and effective method to diagnose schizophrenia with negative symptoms. Recent studies had shown that impaired speech could be considered as an indicator to diagnose schizophrenia. The literature about schizophrenic speech detection was mainly based on feature engineering, in which effective feature extraction is difficult because of the variability of speech signals.Entities:
Keywords: Attention mechanism; Deep learning; Pathological speech detection; Schizophrenia; Skip connection
Mesh:
Year: 2021 PMID: 34344372 PMCID: PMC8336375 DOI: 10.1186/s12938-021-00915-2
Source DB: PubMed Journal: Biomed Eng Online ISSN: 1475-925X Impact factor: 2.819
Text for speech recording in Mandarin and its corresponded English translation
| Emotion | Text (Mandarin) | Text (English) |
|---|---|---|
| Calm | Ta yi nian si ji dou ke yi kai hua, hua duo yi ban shi hong se huo fen se de. | It can bloom all year round, and the flowers are generally red or pink. |
| Anger | Gen ni shuo le duo shao ci le, bu xu wan wo de wan! Kan ba, wan bei da sui le! Ni zhen de shi yao qi si wo! | I told you so many times that you are not allowed to play with my bowls! Look, the bowl is shattered! You are really mad at me! |
| Fear | Ma ma, dui bu qi, wo...wo...wo bu shi gu yi de. | Mom, I’m sorry, I...I...I didn’t mean it! |
| Happiness | Ha ha, tai hao la! Tai hao la! Ma ma, ma ma, wo kao le 98 fen! | Awesome, it’s awesome! Mom, Mom, I got 98 points! |
Sch-net architecture details
| Layer | Dimension |
|---|---|
| Conv1 | 2×[3×3(64 filters)] |
| Conv2 | 2×[3×3(128 filters)] |
| Conv3 | 2×[3×3(256 filters)] |
| Conv4 | 2×[3×3(512 filters)] |
| Conv5-8 | 3×3(512 filters) |
| Max-pooling | 2×2 |
| Average-pooling | 2×2 |
| CBAM | 7×7 (2048 filters) |
| FC | 1×1×512, 1×1×2 (two hidden layers) |
Overall performance of schizophrenic speech detection using Sch-net and its components (backbone, skip connection (SC), and CBAM)
| Evaluated indicators | 95% CI | |||
|---|---|---|---|---|
| Backbone | Backbone + SC | Backbone + CBAM | Sch-net (ours) | |
| Accuracy | 0.9323 | 0.9494 | 0.9563 | 0.9768 |
| (0.9295,0.9351) | (0.9460,0.9528) | (0.9534,0.9591) | (0.9739,0.9797) | |
| Precision | 0.9480 | 0.9634 | 0.9513 | 0.9639 |
| (0.9445,0.9515) | (0.9564,0.9704) | (0.9458,0.9568) | (0.9585,0.9693) | |
| Recall | 0.9149 | 0.9348 | 0.9622 | 0.9908 |
| (0.9100,0.9197) | (0.9326,0.9370) | (0.9556,0.9688) | (0.9898,0.9918) | |
| F1-score | 0.9311 | 0.9487 | 0.9565 | 0.9771 |
| (0.9280,0.9341) | (0.9456,0.9519) | (0.9536,0.9594) | (0.9743,0.9799) | |
| Sensitivity | 0.9176 | 0.9619 | 0.9902 | 0.9914 |
| (0.9131,0.9221) | (0.9581,0.9657) | (0.9847,0.9956) | (0.9863,0.9964) | |
| Specificity | 0.9488 | 0.9601 | 0.9494 | 0.9738 |
| (0.9415,0.9561) | (0.9513,0.9689) | (0.9437,0.9551) | (0.9656,0.9820) | |
| AUC | 0.9593 | 0.9892 | 0.9902 | 0.9978 |
| (0.9577,0.9609) | (0.9859,0.9924) | (0.9880,0.9924) | (0.9965,0.9990) | |
Fig. 1Box plots of accuracy for classifying schizophrenic speech and controls using Sch-net and its components (backbone, skip connection (SC), and CBAM)
Performance of feature engineering and classifiers on schizophrenic speech detection
| Classifier | Feature | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Time-domain feature | FFT-based spectral feature | Auditory-based spectral feature | Spectral envelope feature | ||||||||
| STE | Pitch | Fluency feature | LTAS | Spectrogram | MFCC | GTCC | LP | SWLP | XLP | ||
| RF | Accuracy | 0.7686 | 0.5935 | 0.6464 | 0.8043 | 0.9245 | 0.9377 | ||||
| Precision | 0.6251 | 0.5847 | 0.6052 | 0.7818 | 0.9055 | 0.9282 | |||||
| Recall | 0.7126 | 0.5754 | 0.5545 | 0.8577 | 0.9549 | 0.9466 | |||||
| F1-score | 0.6306 | 0.6322 | 0.5513 | 0.8144 | 0.9280 | 0.9391 | |||||
| KNN | Accuracy | 0.5385 | 0.7390 | 0.7504 | 0.8626 | 0.9204 | 0.9287 | ||||
| Precision | 0.6410 | 0.5308 | 0.6489 | 0.8977 | 0.8939 | 0.9117 | |||||
| Recall | 0.6063 | 0.5597 | 0.6152 | 0.8312 | 0.9549 | 0.9636 | |||||
| F1-score | 0.6123 | 0.5382 | 0.6211 | 0.8591 | 0.9257 | 0.9315 | |||||
| SVM | Accuracy | 0.5172 | 0.7746 | 0.7358 | 0.8625 | 0.9164 | 0.9291 | ||||
| Precision | 0.6447 | 0.5087 | 0.6556 | 0.8555 | 0.8980 | 0.9126 | |||||
| Recall | 0.5999 | 0.4767 | 0.5435 | 0.8929 | 0.9470 | 0.9466 | |||||
| F1-score | 0.6155 | 0.4644 | 0.5813 | 0.8689 | 0.9206 | 0.9317 | |||||
| LDA | Accuracy | 0.5087 | 0.7452 | 0.7314 | 0.8887 | 0.9026 | 0.9069 | ||||
| Precision | 0.6447 | 0.4625 | 0.6622 | 0.9394 | 0.8963 | 0.8868 | |||||
| Recall | 0.5898 | 0.5391 | 0.5380 | 0.8474 | 0.9198 | 0.9111 | |||||
| F1-score | 0.6093 | 0.4710 | 0.5821 | 0.8807 | 0.9060 | 0.9079 | |||||
Performance of schizophrenic speech detection using classic deep neural networks and the proposed Sch-net
| Evaluated indicators | 95% CI | |||||
|---|---|---|---|---|---|---|
| AlexNet | VGG16 | ResNet34 | DenseNet121 | Xception | Sch-net (ours) | |
| Accuracy | 0.9272 | 0.9247 | 0.9439 | 0.9469 | 0.9503 | |
| (0.9249,0.9295) | (0.9225,0.9269) | (0.9398,0.9480) | (0.9449,0.9489) | (0.9482,0.9524) | ||
| Precision | 0.9279 | 0.8937 | 0.9074 | 0.9555 | 0.9462 | |
| (0.9226,0.9333) | (0.8900,0.8973) | (0.9024,0.9124) | (0.9516,0.9594) | (0.9421,0.9503) | ||
| Recall | 0.9268 | 0.9643 | 0.9890 | 0.9375 | 0.9551 | |
| (0.9251,0.9285) | (0.9643,0.9643) | (0.9822,0.9958) | (0.9375,0.9375) | (0.9545,0.9556) | ||
| F1-score | 0.9273 | 0.9276 | 0.9463 | 0.9464 | 0.9506 | |
| (0.9252,0.9293) | (0.9257,0.9295) | (0.9423,0.9503) | (0.9445,0.9483) | (0.9486,0.9526) | ||
| Sensitivity | 0.6399 | 0.8795 | 0.9798 | 0.9262 | 0.9482 | |
| (0.6244,0.6554) | (0.8715,0.8875) | (0.9725,0.9870) | (0.9244,0.9280) | (0.9409,0.9556) | ||
| Specificity | 0.8747 | 0.9268 | 0.9399 | 0.9646 | 0.9738 | |
| (0.8564,0.893) | (0.9164,0.9372) | (0.9331,0.9467) | (0.9567,0.9724) | (0.9656,0.9820) | ||
| AUC | 0.7935 | 0.9447 | 0.9888 | 0.9908 | 0.9924 | |
| (0.7868,0.8003) | (0.9422,0.9472) | (0.9855,0.9921) | (0.9899,0.9917) | (0.9912,0.9936) | ||
Fig. 2Box plots of accuracy for classifying schizophrenic speech and controls using five neural networks and Sch-net
Fig. 3Spectrogram and corresponding activation map of normal speech and schizophrenic speech in four emotions. a The spectrogram and corresponding activation map of normal speech and schizophrenic speech in calm emotion. b The spectrogram and corresponding activation map of normal speech and schizophrenic speech in anger emotion. c The spectrogram and corresponding activation map of normal speech and schizophrenic speech in fear emotion. d The spectrogram and corresponding activation map of normal speech and schizophrenic speech in happiness emotion
Results of SLI detection using state-of-the-art methods, classic deep neural networks and the proposed Sch-net
| Method | Accuracy | Precision | Recall | F1-score | |
|---|---|---|---|---|---|
| State-of-the-art method | Grill [ | 0.9694 | 1.0000 | 0.9474 | 0.9730 |
| Ramarao [ | 0.9941 | ||||
| Slogrove [ | 0.9800 | 0.9900 | 0.9900 | 0.9900 | |
| Sharma [ | 0.9790 | - | - | ||
| Reddy [ | 0.9882 | ||||
| Deep Neural Network | AlexNet | 0.9132 | 0.9585 | 0.8810 | 0.9181 |
| VGG16 | 0.9230 | 0.9897 | 0.8787 | 0.9309 | |
| ResNet34 | 0.9329 | 0.9489 | 0.9286 | 0.9386 | |
| DenseNet121 | 0.9461 | 0.9397 | 0.9643 | 0.9518 | |
| Xception | 0.9622 | 0.9514 | 0.9863 | 0.9685 | |
| Sch-net (our) | 0.9952 | 0.9979 | 0.9937 | 0.9958 | |
Fig. 4Architecture of Sch-net for automatic schizophrenic speech detection
Fig. 5Diagram of the backbone network of Sch-net
Fig. 6Diagram of the backbone network + skip connections of Sch-net