Literature DB >> 29165085

A Speech Recognition-based Solution for the Automatic Detection of Mild Cognitive Impairment from Spontaneous Speech.

Laszlo Toth¹, Ildiko Hoffmann², Gabor Gosztolya¹, Veronika Vincze¹, Greta Szatloczki³, Zoltan Banreti⁴, Magdolna Pakaski³, Janos Kalman³.

Abstract

BACKGROUND: Even today the reliable diagnosis of the prodromal stages of Alzheimer's disease (AD) remains a great challenge. Our research focuses on the earliest detectable indicators of cognitive decline in mild cognitive impairment (MCI). Since the presence of language impairment has been reported even in the mild stage of AD, the aim of this study is to develop a sensitive neuropsychological screening method which is based on the analysis of spontaneous speech production during performing a memory task. In the future, this can form the basis of an Internet-based interactive screening software for the recognition of MCI.
METHODS: Participants were 38 healthy controls and 48 clinically diagnosed MCI patients. The provoked spontaneous speech by asking the patients to recall the content of 2 short black and white films (one direct, one delayed), and by answering one question. Acoustic parameters (hesitation ratio, speech tempo, length and number of silent and filled pauses, length of utterance) were extracted from the recorded speech signals, first manually (using the Praat software), and then automatically, with an automatic speech recognition (ASR) based tool. First, the extracted parameters were statistically analyzed. Then we applied machine learning algorithms to see whether the MCI and the control group can be discriminated automatically based on the acoustic features.
RESULTS: The statistical analysis showed significant differences for most of the acoustic parameters (speech tempo, articulation rate, silent pause, hesitation ratio, length of utterance, pause-per-utterance ratio). The most significant differences between the two groups were found in the speech tempo in the delayed recall task, and in the number of pauses for the question-answering task. The fully automated version of the analysis process - that is, using the ASR-based features in combination with machine learning - was able to separate the two classes with an F1-score of 78.8%.
CONCLUSION: The temporal analysis of spontaneous speech can be exploited in implementing a new, automatic detection-based tool for screening MCI for the community. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.

Entities: Chemical Disease Gene Species

Keywords: Mild cognitive impairment; acoustic analysis; diagnosis; machine learning; speech recognition; spontaneous speech; temporal features

Mesh：

Year: 2018 PMID： 29165085 PMCID： PMC5815089 DOI： 10.2174/1567205014666171121114930

Source DB: PubMed Journal: Curr Alzheimer Res ISSN： 1567-2050 Impact factor: 3.498

Introduction

Alzheimer’s disease (AD) is a quite distinct neurodegenerative disorder that develops for years before clinical manifestation. Although it has been extensively researched, uncertainty regarding the diagnosis of its prodromal stages still exists. However, the symptoms of mild cognitive impairment (MCI) might be detected years before the actual diagnosis [1]. This tells us that the clinical appearance of AD is preceded by a prolonged, preclinical phase. Hence, early diagnosis and timely treatment are very important, as the pro-gression can be slowed down and the occurrence of new symptoms can be delayed [2]. MCI is a heterogeneous syndrome that has a clinical importance in the early detection of both AD [3] and the prodromal state of dementia. MCI often remains undiagnosed, as recognizing cognitive impairment is challenging for clinicians at any stage of the disease: up to 50% of even later stage dementia fails to be recognized [4]. Widely used screening tests such as the Mini-Mental State Examination (MMSE) are not sensitive enough to reliably detect subtle impairments present in patients with early-stage MCI. Linguistic memory tests like word list and narrative recall are more effective in the detection of MCI, but they tend to produce undesired false positive diagnoses [5]. MCI is known to influence the linguistic abilities of the patient in the following aspects: temporal changes in spontaneous speech [5-11], phonemic paraphasia [7, 9, 10, 12-14], word retrieval and word finding difficulties [11, 15-20] and verbal fluency difficulties [5, 10, 17, 21-25]. Although language impairment has been known to occur early in the disease process [26], only minimal attention has been paid to a formal language evaluation when diagnosing AD [27]. Since language impairment has been reported even in the mild stage of AD, we recently developed a sensitive neuropsychological screening method that is based on a memory task, triggered by spontaneous speech [7]. In the future, this approach might from the basis of screening for MCI through a computerized, interactive test implemented as a software package [28]. MCI is known to influence the (spontaneous) speech of the patient via three main aspects [11]. Firstly, the verbal fluency of the patient deteriorates, which results in distinctive acoustic changes – most importantly, in longer hesitations and a lower speech rate [9, 10]. Secondly, as the patient has trouble finding the right word, the lexical frequency of words and part-of-speech tags may also change significantly [19, 20, 29]. Thirdly, the emotional responsiveness of the patient was also reported to change in many cases. There are attempts to detect these changes based on the paralinguistic and prosodic features of the patient’s speech [30-34]. The goal of the study we present here was twofold. First, we sought to find acoustic (temporal) parameters that have a high correlation with MCI. For this, our starting point was our earlier study, where we examined the speech of AD patients [7]. There we compared the articulation rate, speech tempo, hesitation ratio, and rate of grammatical errors of AD patients versus a normal control group. Our results showed that these acoustic parameters may have a diagnostic value for mild-stage AD and thus can be viewed as acoustic biomarkers of AD. Here, we perform the statistical evaluation of a similar set of temporal features, but in this case on the speech recordings of MCI patients. The manual extraction and analysis of the temporal features can be very informative regarding the acoustic correlates of MCI. However, performing the same analysis for each new patient for diagnostic purposes would be very time consuming. Hence, finding ways of automating the feature extraction process would be vital. Furthermore, if the diagnostic decision process could be automated as well, then the corresponding algorithms could form the basis of a fully automatic screening test. The simplest way of automating the feature extraction procedure is to apply signal processing methods. With these, one can separate voiced and voiceless segments, speech and non-speech, or periodic-aperiodic parts [9, 30]. However, the precise measurement of temporal features such as articulation rate would require the use of automatic speech recognition (ASR) tools. Several researchers have already used ASR in the detection of dementia or MCI, but they usually applied off-the-shelf ASR solutions [5, 10, 19, 35]. In this study, we present an ASR system that was especially tailor-made for detecting our acoustic biomarkers. The most important difference is that, instead of minimizing the word error rate, our system focuses on finding the phones. Even more unusually, our biomarker extraction step requires the ASR only to find the phone boundaries, as the actual phone labels are not required by the feature extraction procedure. A further difference compared to standard ASR solutions is that finding and identifying filled pauses is important for us, while these segments are usually filtered out from a standard ASR output as garbage. More details and explanation will be given in Sections 3 and 4. The second step of automation is to obtain a diagnostic decision algorithmically. For this purpose, we applied machine learning technologies. Based on the manual labeling (yes/no diagnosis) of our MCI and control patients, we trained the Weka machine learning toolkit [36] to distinguish the two classes. We performed this classification experiment with both the manually and the automatically extracted features, in order to see how the two feature sets compare. The structure of this article is as follows. First, we present the subjects and the data acquisition method in Section 2. We introduce our set of temporal features and perform statistical analysis on them in Section 3. In Section 4 we present our method for the automation of feature extraction using ASR, and in Section 5 we attempt to automatically classify MCI cases using machine learning methods. Lastly, in Section 6, we make some concluding remarks.

subjects and data collection

A total of 84 subjects participated in the study. From this, 48 persons were MCI patients and 36 were healthy controls. All the 48 subjects with MCI were right-handed, native speakers of Hungarian and they also had a negative medical history for hearing impairment. They were also free of any medication that might influence cognitive functions in any way. The healthy control (NC) group included 36 participants, who were also right-handed, native speakers of Hungarian. The NC group did not differ significantly from the MCI group in either gender (χ2-test, p=0.791) or years of education (t-test, p=0.0807). As regards age, the MCI group differed significantly from healthy controls (t-test, p = 0.0322). That is, subjects older than 71 years were more likely to belong to the MCI group. The following clinical tests were applied to assess the cognitive state of the subjects: Clock Drawing Test [37], Mini-Mental State Examination [38], ADAS-Cog [39]. The state of depression was assessed using the Geriatric Depression Scale. The main statistical properties of the MCI and the control group are summarized in Tables and . All the tests were carried out at the Alzheimer Disease Research Centre of the University of Szeged, Hungary. The research was approved by the Ethics Committee of the University of Szeged, and all experiments were conducted in accordance with the Declaration of Helsinki. The linguistic protocol of collecting the speech samples from the subject consisted of the following steps. After presenting a specially designed one-minute-long animated movie, the subjects were requested to talk about the events they saw to happen the film (Task 1 - “immediate recall”). After this, we asked the subjects to describe their previous day (Task 2 - “spontaneous speech”). Finally, as the last task, a second film was presented, and the subjects were asked to talk about this film after a one-minute long distraction (Task 3 – “delayed recall”). This way, we obtained three recordings from each subject, corresponding to the three different tasks. The recording was performed with an Olympus WS-750M digital voice recorder and a König tie clip microphone. The sound samples were recorded in wma format, which was later converted into an uncompressed PCM mono, 16-bit wav format with a sampling rate of 16000 Hz.

Manually extractEd TEMPORAL features

After a careful listening, the recordings were transcribed both orthographically and phonetically. The manual analysis of the signals was carried out with the help of the PRAAT software (www.praat.org, [40]). The acoustic parameters we examined are the temporal variables listed in Table . We mention that hesitation was defined as the absence of speech for more than 30 ms [41]. Notice that while the first three features focus on the phonetic content of the signals, the remaining five features try to extract information about the pauses found in the recordings. However, hesitations do not necessarily mean silent pauses, but can also correspond to filled pauses where the speaker hums, or produces other hesitating sounds (uhm, er, etc.). Hence, we decided to calculate the last five features both for silent and filled pauses separately. Furthermore, we also calculated them with both types of pauses taken into account. Hence, these five features resulted in fifteen values, so altogether with the first three features we extracted eighteen features from each utterance. It should be mentioned that, technically speaking, the calculation of the acoustic parameters requires the precise placement of the phonetic segment boundaries, and the identification of each segment as pause, filled pause or phoneme. Hence, the manual processing step using PRAAT basically consisted of the phonetic segmentation and labeling of the utterances. We will try to replace this manual process by speech recognition methods in Section 4.

Statistical Analysis of the Features

In order to investigate the importance of each feature, we carried out a statistical analysis. We applied the t-test to each temporal feature for each task and compared the data from the MCI group to those got from the NC group. The t-test applied was a one-tailed t-test for unpaired samples and unequal variances. Table shows the p-values for each parameter and for each of the three tasks. The cases where the difference between the two groups is significant (i.e. where p < 0.05) are marked in bold. Our analysis found that duration, the total length of silent pauses and the (total) length of pauses indicated significant differences between controls and MCI patients for all the three tasks. As for articulation rate, speech rate and the ratio of pauses and duration, there were significant differences among controls and MCI patients in tasks 1 and 3 but not in task 2. The number of silent pauses and pauses differed significantly in tasks 1 and 2, whereas the total length of filled pauses showed significant differences in tasks 2 and 3. There were also certain parameters that were significant only in the case of one of the tasks. Namely, the number of filled pauses and the ratio of filled pauses and duration showed significant differences only in task 2, while the ratio of silent pauses and duration, the average length of silent pauses and the average length of pauses were significant only in task 3.

Automatic extraction of acoustic biomarkers

In the last section we saw that the majority of the temporal features examined correlate with the diagnosis of MCI, so they might be regarded as acoustic biomarkers of MCI. However, extracting these biomarkers manually is rather tedious and requires skilled labor, as their calculation is based on the phonetic segmentation of the recordings. In this section we present our special, speech recognition-based solution for the automatic extraction of the relevant acoustic features. But before doing this, we will give a brief overview of the related studies found in the literature.

Related Work

The simplest way of automating the feature extraction procedure is to apply signal processing methods. For example, signal processing algorithms can be used for the separation of voiced and voiceless segments, for finding the silent sections of a recording, or for the extraction of prosodic features. Satt et al. derived continuity features from the durations of contiguous voice and silence segments, and from the lengths of periodic and aperiodic segments. They separated the voiced and the silent segments using a simple voice activity detection algorithm, based on the pitch-synchronous intensity curve of the recorded speech signal. The periodic and aperiodic segments were detected based on the pitch contour. They calculated these features using the PRAAT software, and intentionally avoided speech recognition tools in order to keep their system language-independent [9]. De Ipiña et al. also applied Praat to cut the acoustic signal into voiced and voiceless segments. They applied an automatic voice activity detector for this purpose, and they represented the segments by creating a statistics of their duration, short-term energy and the location of spectral centroids [30]. Rapcan et al. separated the speech and non-speech parts of the input signal using relatively simple signal processing methods. They employed a thresholding method to the energy contour of the speech signal, and also applied an edge detection algorithm to find the boundaries of breath sounds. An important novelty of this paper is that the authors recognized that elongated breaths may correspond to filled pauses, and thus they may be discriminative features in the cognitive function. Based on this observation, Rapcan et al. took special care to separate breath sounds from speech [42]. The advantage of the above-mentioned signal processing methods is that they are relatively simple, and also largely language-independent. However, extracting features like periodicity/aperiodicity can give only a rough estimate of such temporal features like the rate of speech. Moreover, as Fraser et al. correctly note, these simple techniques cannot distinguish filled pauses from speech, so they will find only the silent pauses [43]. An alternative option is to involve automatic speech recognition (ASR) tools in the process of feature extraction. This approach is language-dependent, but allows the precise calculation of phone duration-based measures. Baldas et al. proposed the use of ASR for transcribing the speech signal, and extract lexical features from the transcripts that may help the early detection of AD. However, they did not apply the ASR to extract acoustic features, and their short conference paper contains no experimental results [29]. Lehr et al. also applied ASR in the detection of MCI, but similar to Baldas et al., they employed the ASR only to obtain an orthographic transcript, and not for the purpose of extracting acoustic features. We note that their ASR system produced relatively large word error rates (between about 30% and 50%), and the error rates were significantly higher for the MCI group than for the control group [35]. Fraser et al. applied the off-the-shelf ASR tool of Nuance in the diagnosis of primary progressive aphasia. They argued that the accuracy of ASR systems for elderly voices usually decreases with the age of the speaker, and that speech recognition can be even more challenging in the presence of linguistic impairments [43]. Roark et al. extracted temporal features that are very similar to the ones we propose here, for example, the pause rate and the phonation rate. They also made use of an ASR tool in the extraction of the features. However, in their case the transcript is obtained manually, and ASR is applied only for creating a forced time alignment. Hence, their approach allows only the automation of narrative recall tests in the best case [5]. The closest study to ours is that of Jarrold et al, who applied ASR to extract both acoustic and lexical features. Their acoustic feature set consisted of duration-based measures like the duration of consonants, vowels, pauses, and other acoustic-phonetic categories. However, they did not turn special attention to handling breaths or any other types of filled pauses [10]. Here, however, we apply a dedicated ASR tool which was adapted to the special needs of extracting the acoustic features discussed in Section 3. We explain the specialties of our ASR system below.

Extracting the Features using a Dedicated ASR Tool

As we have seen, some authors already attempted to use an ASR tool as part of the feature extraction process. However, using off-the-shelf ASR software (similar to that used by Fraser et al. [43]) may be suboptimal. The explanation is that standard speech recognizers are optimized to minimize the number of transcription errors at the level of words, while here we want to extract non-verbal acoustic features such as the rate of speech or the duration of pauses. Note, for example, that none of our acoustic features require the actual identity of the phones; instead, we need only to count them, and to measure their duration. Furthermore, the filled pauses do not appear explicitly in the output of a conventional ASR system, while for our purposes they convey an important piece of information. Also, our data collection method – Task 2 in particular – would require a domain-independent or “open-domain” ASR software. Because of the highly agglutinative nature of Hungarian, such recognizers do not yet exist for Hungarian, and the few exceptions (like Google's free ASR tool) work with a surprisingly large word error rate [44]. In addition to this, the speech of dementia patients has been observed to contain an increased amount of agrammatical sentences and incorrect word inflections [19, 43]. Preparing a standard ASR system to handle all these non-standard errors would require a statistical model of the language of demented people, for which our sample set was definitely too small. Moreover, similar studies that experimented with the automatic recognition of the speech of MCI patients reported quite large word error rates in the range of 30 to 50% [35]. Because of the reasons presented above, we decided to create a dedicated version of our speech recognizer that has been tailored to the special requirements of the task. Rather than producing a word-level transcript, it provides only a phone sequence as output, which includes filled pause as a special ‘phone’. Of course, the recognition of spontaneous speech of elderly people is known to be relatively difficult [43, 45]. Doing this only at the phonetic level, that is, without restricting the vocabulary obviously increases the number of recognition errors. However, as we already pointed out, many types of phone recognition errors do not harm the extraction of the acoustic indicators. Hence, the main question of our experiments was whether our acoustic features (and the subsequent classification step described later) tolerates the inaccuracies introduced by switching from the manual to the automatic extraction method. The technical details of how we constructed our dedicated ASR system is as follows. To train the speech recognizer we used the BEA Hungarian Spoken Language Database [46]. This database contains spontaneous speech, similar to the recordings collected from the MCI patients. We used approximately seven hours of speech from the BEA corpus, mostly recordings from elderly persons, in order to match the age group of the targeted MCI patients. Although the BEA dataset contains spontaneous speech, its original annotation did not perfectly fit our needs. It contained the word-level transcript of the utterances, but the filled pauses and other non-verbal audio segments (coughs, laughters, sighs, breath intakes etc.) were not or improperly marked. Thus we adjusted the annotation of the recordings to our needs. This mainly consisted of extending the transcripts by annotating filled pauses, breath intakes and exhales, laughter, gasps and coughs in a consistent manner. We rained the ASR system to recognize the phones in the utterances, where the phone set was extended to include the special nonverbal labels mentioned above. For acoustic modeling we applied a special convolutional deep neural network (CNN) based technology. With this approach we achieved one of the lowest phone recognition error rates ever reported on the TIMIT database [47]. As a language model we applied a phone bigram, which is a very simple statistical model that estimates the probability of the next phone based on the actual one. Naturally, these phone bigrams also included the non-verbal audio tags mentioned earlier. The ASR system outputs the phonetic segmentation and labeling of the input signal, including the filled pauses. Using this output, the acoustic biomarkers we introduced in Section 3 can be easily extracted by performing simple calculations. More technological details on our ASR solution can be found in our conference paper [48].

CLASSIFYING MCI VIA MACHINE LEARNING

Our final goal is the development of a software application that would allow the users to self-test themselves for MCI. Depending on the results of the test, the program would suggest the subject to visit a neurologist to go through a more detailed examination. To create such a piece of software, not only the feature extraction process but also the decision making step needs to be automated. We made the decision making procedure automatic using machine learning as follows. The values of the acoustic features are forwarded to a machine learning method, which classifies the patient as either having MCI or not. The manually extracted feature values were at our hands for all the test files, so the classification results produced by the machine learning on this feature set were used as our baseline. We repeated the feature extraction step using the ASR tool, and we compared the resulting accuracy scores with the baseline. For machine learning, we applied the Weka tool, which is a free and open-source collection of machine learning algorithms [36]. As our dataset was very small, we restricted ourselves to simpler classification methods, namely Naive Bayes, linear SVM and Random Forests. Naive Bayes is a natural choice in the case of very small datasets, as it has a very efficient modeling strategy to alleviate the so-called “curse of dimensionality” [49]. We chose SVM because it is one of the most popular classification methods in bioinformatics [50]. Finally, Random Forest is a relatively new classification algorithm which is also known to be robust in the case of limited data, but its modeling strategy is much more sophisticated than that of Naive Bayes [51, 52]. The technical details of training Weka for the MCI classification task are as follows. We treated the three utterances of each patient as one training sample, so altogether we had 84 training samples. The goal of training was to decide whether a given speaker has MCI or not, which results in a 2-class classification task. As for each subject we had three recordings collected from the three different tasks, by concatenating the eighteen biomarkers shown in Table we obtained 54 feature values per subject. From a machine learning perspective, this dataset is rather small. Unfortunately, the number of diagnosed MCI patients is quite limited, and it is tedious to collect recordings from them. All the similar studies we found worked with fewer than 100 patients [5, 9, 10, 35, 43]. Having so few examples, we did not form separate training and test sets, but applied the so-called leave-one-out method. This means that we withheld one example (i.e. one subject), trained our classifier on the remaining ones, and then evaluated the model on the withheld sample. We repeated this step for all the examples and finally aggregated the resulting accuracies into one final score. From Weka, we used the implementations called NaiveBayes, SMO and RandomForest. We applied Naive Bayes with its default settings; for Random Forest we used 100 decision trees, while we varied the number of randomly chosen features (on which these trees were trained on). We used SVM with a linear kernel, and varied the C complexity meta-parameter in the range of 10-5 to 102. The optimal meta-parameters for the Random Forest and SVM methods were found by iterating through these fixed sets and choosing the value that produced the highest F-measure. The choice of evaluation metric is not an easy and clear-cut issue for this task. From a machine learning perspective, we can apply standard information retrieval metrics: precision measures what percentage of the MCI detections correspond to real MCI cases (true positives per true plus false positives), whereas recall tells us what percentage of the real MCI occurrences were detected (true positives per true positives plus false negatives). For those who prefer to express the performance of the classifier by just one number, the standard choice is the F-measure (or F), which is the harmonic mean of precision and recall. Lastly, as in this case the class distribution is quite balanced, optimizing for accuracy (defined as the number of correctly classified examples over the total number of examples) makes sense as well. Readers coming from the field of biomedicine might be more familiar with the terms sensitivity (the same as recall) and specificity (false positives per true positives plus false negatives), and might also be interested to see the ROC curve and the corresponding area under the curve (AUC) values. We list all these metrics in the tables, and also show ROC curves for the sake of completeness, but we stress that during training the parameters were chosen to optimize the F-score of the MCI class. Hence, the other metrics may not be perfectly optimal due to the well-known tradeoff between specificity and sensitivity (or precision and recall).

Classification Results and Discussion

Table shows the classification accuracy measures obtained using all the features extracted. We compared the performance of three classifier algorithms – Naive Bayes, SVM and Random Forest - using the manually and the automatically extracted features. It can be seen that, for the manually extracted feature set, SVM outperformed both Random Forest and Naive Bayes. However, Random Forest worked somewhat better than SVM with the automatic feature set. Hence, while Naive Bayes performed clearly the worst for both features sets, we cannot draw a definite conclusion as regards which machine learning algorithm - SVM or Fandom Forest - is more suitable for the given classification task. Comparing the two feature sets, the best accuracy scores attained (with Random Forest for the automatic features, and with SVM for the manual features) are equivalent (71.4%), and the F-score with the automatic features is slightly better (76.0% vs. 75.0%). The fact that the F-scores and accuracy scores achieved with the automatically extracted feature set are competitive with the scores of the manually calculated features demonstrates that our approach of using ASR techniques for feature extraction is viable. Comparing the precision and recall values, the Random Forest method shows a clear preference for the automatic feature set, as the recall values are the same, while the precision is higher. The case of SVM is not that clear as it gives a higher recall for the automatic features set, and a higher precision for the manual set. In this case the ROC curve is worth examining, as it allows the evaluation of a classifier at various true and false positive rates. Figs. (, , ) show the ROC curves of the three classifiers. In the case of the Naive Bayes classifier, the automatic feature set is worse than the manual one in almost all cases, and this fact is also clearly reflected by the corresponding AUC value in Table . However, none of the curves have a clear dominance in the case of the SVM and the Random Forest classifiers, and the best AUC values are also very close (70.8% for the manual and 69.9% for the automatic feature set).

Fig. (1)

(a). The ROC curve for the Naive Bayes classifier. (b). The ROC curve for the SVM classifier. (c). The ROC curve for the Random Forest classifier.

Lastly, we performed an experiment where we removed those features that were found not to be significant in Section 3.1. This way, from the 54 features only 26 were retained (cf. Table ). The machine learning methods were applied the same way as before. Table shows the classification accuracy values obtained using only the statistically significant features. In the case of the SVM classifier, we obtained slightly worse results compared to using all the features, which, in our opinion, reflects the fact that SVM could make use of even those features which did not display a statistically significant difference among the two groups. On the other hand, the performance of the Random Forest classifier slightly improved with the reduction of the feature set. In summary, the best F-score we could achieve is 78.8% with the automatic features (reduced feature set, Random Forest classifier), and 75.0% for the manual features (full feature set, SVM classifier). The achieved F-score of 78.8% is far better than what a random guesswork would give, and we hope that with future refinements our approach will serve as the starting point for developing a fully automatic MCI screening software package.

CONCLUSION

Here, we performed a statistical analysis of a previously proposed acoustic feature set on the spontaneous speech of MCI patients. The analysis indicated that the speech rate the number and duration of silent and filled pauses, and some other derived features behave significantly differently for MCI patients than those for control people, and hence these features can be used as acoustic biomarkers to strengthen the diagnosis of MCI. We also went one step further, and introduced a speech recognition based method for the automatic extraction of these features. Lastly, we attempted to automate the diagnosis as well, using machine learning methods. In our experiments, we were able to separate the MCI patients from the control group with an F-score of 78.8% using only automatically extracted features and automatic classification.

Disclosure

Part of this article has been previously published in 16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015 Dresden, Germany 2694-98.

ETHICS APPROVAL AND CONSENT TO PARTICIPATE

All the tests were carried out at the Alzheimer Disease Research Centre of the University of Szeged, Hungary. The research was approved by the Ethics Committee of the University of Szeged, and all experiments were conducted in accordance with the Declaration of Helsinki.

HUMAN AND ANIMAL RIGHTS

No animal were used in this research. All humans research procedures followed were in accordance with the standards set forth in the Declaration of Helsinki principles of 1975, as revised in 2008 (http://www.wma.net/en/20activities/10ethics/10helsinki/).

CONSENT FOR PUBLICATION

Patient consent of all participants were obtained.

Table 1a

The main statistics of the MCI and the control groups - personal data.

	Age			Years of Education			Sex
	Mean	SD	Range	Mean	SD	Range	Male	Female	Total
MCI	73.08	7.95	55 - 93	11.82	3.29	8 - 18	16	32	48
Control	64.13	7.08	57 - 84	12.47	3.21	8 - 20	13	23	36

Table 1b

The main statistics of the MCI and the control groups - mental test results.

	MMSE Score			ADASCog Score			Clock Drawing Score
	Mean	SD	Range	Mean	SD	Range	Mean	SD	Range
MCI	26.97	0.96	25 - 28	11.97	3.15	6.3 - 16.6	6.91	3.17	0 - 10
Control	29.17	0.71	28 - 30	8.25	2.19	6.0 - 16.6	9.11	1.75	2 - 10

Table 2

The proposed acoustic features.

Name	Description
Duration	The total duration of the utterance (ms)
Speech rate	The number of phonemes per second during speech (including hesitations); the number of total phonemes uttered, divided by the total duration of the utterance
Articulation rate	The number of phonemes per second during speech (excluding hesitations)
Number of pauses	The number of pause occurrences
Total length of pauses	The total duration of pauses (ms)
Total length of pauses / Duration	The ratio of total pause duration and the length of the utterance (%)
Pause rate	The number of pause occurrences divided by the total duration of the utterance
Average length of pauses	The total duration of pauses divided by the number of pauses

Table 3

The significance of each feature in the three tasks.

	Task 1	Task 2	Task 3
Duration	0.0369	0.0005	0.0218
Speech rate	0.0074	0.1346	0.0002
Articulation rate	0.0138	0.1073	0.0023
No. of silent pauses	0.0389	0.0018	0.1131
No. of filled pauses	0.0739	0.0011	0.0989
No. of pauses	0.0290	0.0008	0.0768
Total length of silent pauses	0.0369	0.0037	0.0068
Total length of filled pauses	0.0588	0.0011	0.0375
Length of pauses	0.0214	0.0014	0.0034
Silent pause / duration	0.0672	0.3850	0.0124
Filled pause / duration	0.0945	0.0398	0.1244
Pause / duration	0.0265	0.2294	0.0037
No. of silent pauses / duration	0.4871	0.1607	0.2591
No. of filled pauses / duration	0.1664	0.1160	0.3886
No. of pauses / duration	0.2375	0.3861	0.3404
Average length of silent pauses	0.0570	0.1247	0.0079
Average length of filled pauses	0.1034	0.1308	0.1749
Average length of pauses	0.0730	0.0913	0.0071

Table 4

The accuracy scores using all the features.

Method	Feature Set	Acc.	Prec.	Recall (Sens.)	Spec.	F₁	AUC
Naive Bayes	Manual	61.9%	72.2%	54.2%	72.2%	61.9%	70.8%
Naive Bayes	Automatic	58.3%	71.0%	45.8%	75.0%	55.7%	62.9%
Random Forest	Manual	67.9%	69.1%	79.2%	52.8%	73.8%	68.2%
Random Forest	Automatic	71.4%	73.1%	79.2%	61.1%	76.0%	69.9%
SVM	Manual	71.4%	75.0%	75.0%	66.7%	75.0%	70.8%
SVM	Automatic	64.3%	66.1%	77.1%	47.2%	71.2%	62.2%

Table 5

The accuracy scores using only the significant features.

Method	Feature Set	Acc.	Prec.	Recall (Sens.)	Spec.	F₁	AUC
Naive Bayes	Manual	66.7%	79.4%	56.3%	80.3%	65.9%	73.0%
Naive Bayes	Automatic	57.1%	68.8%	45.8%	72.2%	55.0%	61.3%
Random Forest	Manual	69.1%	71.2%	77.1%	58.3%	74.0%	73.4%
Random Forest	Automatic	75.0%	76.5%	81.3%	66.7%	78.8%	67.6%
SVM	Manual	65.5%	67.9%	75.0%	52.8%	71.3%	73.4%
SVM	Automatic	64.3%	69.6%	66.7%	61.1%	68.1%	63.9%

29 in total

1. Acoustic markers associated with impairment in language processing in Alzheimer's Disease.

Authors: Juan J G Meilán; Francisco Martínez-Sánchez; Juan Carro; José A Sánchez; Enrique Pérez
Journal: Span J Psychol Date: 2012-07 Impact factor: 1.264

2. The anticipatory proportion as an indicator of language impairment in early-stage cognitive disorder in the elderly.

Authors: Alexander Wutzler; Regine Becker; Gernot Lämmler; Wilhelm Haverkamp; Elisabeth Steinhagen-Thiessen
Journal: Dement Geriatr Cogn Disord Date: 2013-09-06 Impact factor: 2.959

3. [Early Mental Test--developing a screening test for mild cognitive impairment].

Authors: János Kálmán; Magdolna Pákáski; Ildikó Hoffmann; Gergely Drótos; Gyöngyi Darvas; Krisztina Boda; Tamás Bencsik; Alíz Gyimesi; Zsófia Gulyás; Magolna Bálint; Gréta Szatlóczki; Edina Papp
Journal: Ideggyogy Sz Date: 2013-01-30 Impact factor: 0.427

4. Voice Markers of Lexical Access in Mild Cognitive Impairment and Alzheimer's Disease.

Authors: Juan J G Meilan; Francisco Martinez-Sanchez; Juan Carro; Nuria Carcavilla; Olga Ivanova
Journal: Curr Alzheimer Res Date: 2018 Impact factor: 3.498

5. Spoken Language Derived Measures for Detecting Mild Cognitive Impairment.

Authors: Brian Roark; Margaret Mitchell; John-Paul Hosom; Kristy Hollingshead; Jeffrey Kaye
Journal: IEEE Trans Audio Speech Lang Process Date: 2011-09-01

6. Morphological cerebral correlates of CERAD test performance in mild cognitive impairment and Alzheimer's disease.

Authors: Vasco Dos Santos; Philipp A Thomann; Torsten Wüstenberg; Ulrich Seidl; Marco Essig; Johannes Schröder
Journal: J Alzheimers Dis Date: 2011 Impact factor: 4.472

7. Effects of ApoE genotype and mild cognitive impairment on implicit learning.

Authors: Selam Negash; Lindsay E Petersen; Yonas E Geda; David S Knopman; Bradley F Boeve; Glenn E Smith; Robert J Ivnik; Darlene V Howard; James H Howard; Ronald C Petersen
Journal: Neurobiol Aging Date: 2006-05-15 Impact factor: 4.673

8. Dementia assessment in primary care: results from a study in three managed care systems.

Authors: Linda Boise; Margaret B Neal; Jeffrey Kaye
Journal: J Gerontol A Biol Sci Med Sci Date: 2004-06 Impact factor: 6.053

9. The effects of very early Alzheimer's disease on the characteristics of writing by a renowned author.

Authors: Peter Garrard; Lisa M Maloney; John R Hodges; Karalyn Patterson
Journal: Brain Date: 2004-12-01 Impact factor: 13.501

10. On the selection of non-invasive methods based on speech analysis oriented to automatic Alzheimer disease diagnosis.

Authors: Karmele López-de-Ipiña; Jesus-Bernardino Alonso; Carlos Manuel Travieso; Jordi Solé-Casals; Harkaitz Egiraun; Marcos Faundez-Zanuy; Aitzol Ezeiza; Nora Barroso; Miriam Ecay-Torres; Pablo Martinez-Lage; Unai Martinez de Lizardui
Journal: Sensors (Basel) Date: 2013-05-21 Impact factor: 3.576

31 in total

Review 1. Voice for Health: The Use of Vocal Biomarkers from Research to Clinical Practice.

Authors: Guy Fagherazzi; Aurélie Fischer; Muhannad Ismael; Vladimir Despotovic
Journal: Digit Biomark Date: 2021-04-16

2. Automatic Assessment of Cognitive Tests for Differentiating Mild Cognitive Impairment: A Proof of Concept Study of the Digit Span Task.

Authors: Meysam Asgari; Robert Gale; Katherine Wild; Hiroko Dodge; Jeffrey Kaye
Journal: Curr Alzheimer Res Date: 2020 Impact factor: 3.498

3. A Review of Automated Speech and Language Features for Assessment of Cognitive and Thought Disorders.

Authors: Rohit Voleti; Julie M Liss; Visar Berisha
Journal: IEEE J Sel Top Signal Process Date: 2019-11-07 Impact factor: 6.856

4. A Single-Channel EEG-Based Approach to Detect Mild Cognitive Impairment via Speech-Evoked Brain Responses.

Authors: Saleha Khatun; Bashir I Morshed; Gavin M Bidelman
Journal: IEEE Trans Neural Syst Rehabil Eng Date: 2019-04-18 Impact factor: 3.802

5. Voice biomarkers as indicators of cognitive changes in middle and later adulthood.

Authors: Elizabeth Mahon; Margie E Lachman
Journal: Neurobiol Aging Date: 2022-07-01 Impact factor: 5.133

6. Automatic Detection of Alzheimer's Disease Using Spontaneous Speech Only.

Authors: Jun Chen; Jieping Ye; Fengyi Tang; Jiayu Zhou
Journal: Interspeech Date: 2021 Aug-Sep

Review 7. Artificial intelligence approaches to predicting and detecting cognitive decline in older adults: A conceptual review.

Authors: Sarah A Graham; Ellen E Lee; Dilip V Jeste; Ryan Van Patten; Elizabeth W Twamley; Camille Nebeker; Yasunori Yamada; Ho-Cheol Kim; Colin A Depp
Journal: Psychiatry Res Date: 2019-12-09 Impact factor: 3.222

8. Cognitive and Structural Correlates of Conversational Speech Timing in Mild Cognitive Impairment and Mild-to-Moderate Alzheimer's Disease: Relevance for Early Detection Approaches.

Authors: Céline De Looze; Amir Dehsarvi; Lisa Crosby; Aisling Vourdanou; Robert F Coen; Brian A Lawlor; Richard B Reilly
Journal: Front Aging Neurosci Date: 2021-04-27 Impact factor: 5.750

9. Automated assessment of speech production and prediction of MCI in older adults.

Authors: Victoria Sanborn; Rachel Ostrand; Jeffrey Ciesla; John Gunstad
Journal: Appl Neuropsychol Adult Date: 2020-12-30 Impact factor: 2.050

10. Automatic Detection of Cognitive Impairments through Acoustic Analysis of Speech.

Authors: Ryosuke Nagumo; Yaming Zhang; Yuki Ogawa; Mitsuharu Hosokawa; Kengo Abe; Takaaki Ukeda; Sadayuki Sumi; Satoshi Kurita; Sho Nakakubo; Sangyoon Lee; Takehiko Doi; Hiroyuki Shimada
Journal: Curr Alzheimer Res Date: 2020 Impact factor: 3.498