Literature DB >> 33584015

Insight into an unsupervised two-step sparse transfer learning algorithm for speech diagnosis of Parkinson's disease.

Yongming Li¹, Xinyue Zhang¹, Pin Wang¹, Xiaoheng Zhang^1,2, Yuchuan Liu¹.

Abstract

Speech diagnosis of Parkinson's disease (PD) as a non-invasive and simple diagnosis method is particularly worth exploring. However, the number of samples of speech-based PD is relatively small, and there exist discrepancies in the distribution between subjects. In order to solve the two problems, a novel unsupervised two-step sparse transfer learning is proposed in this paper to tackle with PD speech diagnosis. In the first step, convolution sparse coding with the coordinate selection of samples and features is designed to learn speech structure from the source domain to replenish sample information of the target domain. In the second step, joint local structure distribution alignment is designed to maintain the neighbor relationship between the respective samples of the training set and test set, and reduce the distribution difference between the two domains at the same time. Two representative public PD speech datasets and one real-world PD speech dataset were exploited to verify the proposed method on PD speech diagnosis. Experimental results demonstrate that each step of the proposed method has a positive effect on the PD speech classification results, and it also delivers superior performance over the existing relative methods.

Entities: Chemical

Keywords: Convolution sparse coding; Domain adaptation; Parkinson’s disease; Speech diagnosis; Two-step sparse transfer learning

Year: 2021 PMID： 33584015 PMCID： PMC7871026 DOI： 10.1007/s00521-021-05741-0

Source DB: PubMed Journal: Neural Comput Appl ISSN： 0941-0643 Impact factor: 5.606

Introduction

Parkinson’s disease (PD) is the second most common degenerative disorder of the nervous system, occurs mostly in the elderly population, and generally deteriorates over time [1]. According to the research, more than 5% of PD is hereditary [2]. With the aging population trend, the number of cases increased year by year [3]. So far, there is no way to cure or prevent PD, but this disease can be controlled through early diagnosis and treatment [4, 5]. Thus, early diagnosis is critical to improve the patient’s quality of life and prolong their lives [6]. Speech disorder is one of the typical symptoms of PD which is commonly called Parkinson's dysarthria [7-9]. Several studies in the literature have described the speech impairments of PD patients in terms of phonation, articulation, and prosody [10-12]. Along with these three aspects of speech, intelligibility is also deteriorated in PD patients causing loss of communication abilities and social isolation, especially at advanced stages of the disease [13]. Therefore, it is of great scientific value and practical significance to further study PD diagnostic ability based on speech datasets, since utilizing speech data can help develop a simple, fast, and non-invasive early PD diagnostic method. The literature shows that a sizable number of researchers have made many attempts to classify people correctly as either PD patients or healthy people based on speech data. They are mostly based on different extracted features, feature selection/transformation methods [8, 14–29], or classifiers [14–16, 20–22, 27, 30–35] to maximize the accuracy of classification of Parkinson's disease. As for feature extraction, the PD speech feature data primarily include pitch type, energy type, speed type, and content type [8, 14, 15]. About feature selection/transformation, the frequently used algorithms are neural network (NN) based [16-18], principal component analysis (PCA) [19-21], serial search based [14, 19, 21], evolutionary based [18, 22, 23], p value based [15, 24], relevance based [25-27], entropy based [28], and LDA based [29] methods. As for classifier design, support vector machine (SVM) [14-16] and k-nearest neighbor (KNN) [14, 15, 20] are two most commonly used classifiers. Others are random forest (RF) [14, 27], Bayesian network [30, 31], discrimination algorithm (DA) [21, 32], probabilistic neural network (PNN) [33-35], decision tree [21, 34], non-nested generalized exemplars (NNge) [35], and so on. Although there are many classification algorithms for PD diagnosis, the predicted results still leave much room to improve. It is worth noting that all the methods above are based on the classification of the original speech dataset and do not take the small sample problem of the dataset into consideration. Transfer learning has the potential to address these problems [36], and the studies [14, 16, 37] confirm the effectiveness of transfer learning in the diagnosis of PD. However, these transfer learning methods only pay attention to the distribution difference between the source domain and the target domain and ignore the difference between the data diversity of the target domain. Some researchers [38-40] have shown that the training set and test set can be regarded as different domains to reduce data distribution difference, but there are few studies in the field of PD speech diagnosis. All the transfer learning methods mentioned above belong to one-step transfer learning and do the transfer from different datasets. The two-step transfer learning methods have achieved significant results in some areas recently [41-43]. For instance, Sakurai et al. [41] achieved semantic plant segmentation by two-step domain adaptation: firstly, adaptation is from a large amount of labeled data to a major category and then adapted category adaptation from the major category to a minor category; An et al. [42] realized age-related macular degeneration diagnosis based on twice transfer of models: firstly, a pre-trained VGG16 model was used, and then, the fine-tuned model in the first step was to transfer learned again to distinguish the images; Similar to G, Zhang et al. [43] utilized two-step transfer learning to detect COVID-19 based on model-transfer, but in different models. Specifically, all the above two-step transfer approaches are based on images and designed differently for different data characteristics and tasks, but are not considered in PD speech recognition. Moreover, there exist discrepancies in the distribution between subjects of PD within single dataset, but the existing methods did not considered this point. In order to solve the problems above, the unsupervised two-step sparse transfer learning (TSTL) is proposed in this paper in PD's speech diagnosis. In the first step, convolution sparse coding learning with the coordinate selection of samples and features (CSC&SF) is proposed to supplement the structure information of PD speech, as for the small samples. And in the second step, due to the discrepancies of subjects, joint local structure distribution alignment (JLSDA) is designed to realize distribution alignment of the training set and test set and retain its original structure. To sum up, the contributions and innovations of this paper are mainly described as follows: A novel two-step transfer learning algorithm, called TSTL is proposed for the classification of PD speech data. The method can help learn useful information from large unlabeled speech data, align the distribution of training set and test set, and retain the original structure between samples at the same time. Transfer learning between different datasets and transfer learning between the training set and test set are combined to construct an unsupervised two-step sparse transfer learning algorithm for the first time. For the first time in the same speech PD dataset, the problem of individual differences among samples is considered as the problem of distribution differences between the training set and test set of PD speech data. The rest of this paper is organized as follows. Section 2 reviews prior works that are related to proposed method. Section 3 introduces the theoretical part of the proposed algorithm. Section 4 describes the experiments to verify the effectiveness of TSTL and each step of it. Section 5 is the discussions and conclusions about this proposed method and future work.

Related works

The proposed method TSTL is a two-step transfer learning method applied in PD speech diagnosis. Thus, this section presents the detail of the prior works on two parts of TLSLT. The first step transfer learning is related to convolution sparse coding (CSC) [44-46], which has great unsupervised sparse learning ability and can find out the implicit structures and patterns in the input data effectively. And CSC can extract the features reflecting the structures and relationship between features and samples, while controlling the number of the features. The transfer learning can be combined with sparse coding [47] to extract more valued information from the public speech datasets, thereby solving the small sample problem and finding out the structures and patterns implicit in the input data at the same time. Due to the small size of PD speech datasets, it is difficult to expand the amount of data. Then, enrich speech structure information by CSC become a valid and feasible way. The second step transfer learning is concerned with domain adaptation (DA) which aims at transferring shared knowledge across different, but related tasks or domains [48]. The common practice for unsupervised domain adaptation (UDA) is to minimize the discrepancy between domains to obtain domain-invariant features [49-53] or learn more discriminative features, while performing domain alignment [54-58]. And there are no labeled instances in the target domain. According to whether the feature space of the source domain and the target domain are similar and have the same-dimensionality, UDA can be divided into homogeneous unsupervised domain adaptation (HoUDA) and heterogeneous unsupervised domain adaptation (HeUDA) [59]. Due to the small size of PD speech datasets, like most domain adaptation models, TSTL is focused on HoUDA. And according to whether generalize deep convolutional neural network to the domain adaptation scenario, UDA can be divided into traditional UDA and deep UDA methods. As for traditional UDA methods, transfer component analysis (TCA) [48] and joint distribution adaptation (JDA) [60] are based on maximum mean discrepancy (MMD), geodesic flow kernel (GFK) [61] proposes to learn the geodesic flow kernel between domains in manifold space, manifold embedded distribution alignment (MEDA) [62] learns a domain-invariant classifier, correlation alignment (CORAL) [39] adjusts the covariance of different domains. As for deep UDA methods, deep adaptation networks (DAN) [63] applies MK-MMD to adapt multi-layer feature, joint adaptation network (JAN) [64] adds joint distribution on the basis of DAN, and inspired by GANs, the single-adversarial model domain-adversarial neural network (DANN) [65] and multi-adversarial model conditional domain adversarial networks (CDANs) [66] are proposed. Although those UDA show strong robustness and generalization among datasets in various fields, the most are applied in image classification and not match the data characteristics of PD speech datasets. The typical UDA methods are compared with proposed approach in Sect. 4.

The proposed method

Problem formulation

The PD speech datasets have the typical characteristics of small samples, which make the training sample insufficient, easily lead to overfitting, and worsens the generalization ability of the classification model. However, there are few relevant methods for diagnosis of PD to deal with the problems above, especially in the field of speech diagnosis of PD. Besides, most algorithms do not consider the effect of differences between Parkinson's subjects. To solve these problems, a two-step sparse transfer learning idea is proposed here. In the first transfer step, the goal is to learn useful information from the public speech data (source domain) and transfer it to the PD speech dataset (target domain) to increase the generalization ability of the PD classification model. The purpose of the second transfer step is to reduce discrepancies by aligning the Parkinson data distribution from training subjects and test subjects. So training subjects are regarded as the source domain, and test subjects are the target domain in the second transfer step. Besides, the original structure between samples is also retained in this transfer step. In the first transfer step, PD speech dataset is target domain dataset , where , partitioned matrix on subjects , . The total number of samples is G, number of features per sample is N, all samples belong to M subject, that is, the number of samples included in each subject is: . Before being used as the source dataset in the first transfer step, the public speech dataset is extended to a larger scale by injecting different SNR and different types of noise. Extended dataset is , and , where is the original speech signal from the public data set, are different types of noise signals, is a function of that adjustment of the types of noise and signal-to-noise ratio (SNR). Features are extracted from the extended data sets and form new feature dataset , as the source domain dataset, where , the feature extraction method in [15] was adopted to extract N different features of the signal. Then, can be expressed as .The total number of feature samples is , is a two-dimensional block matrix, is a sparse dictionary learning training sample, and is convolution kernels sparse learning training samples. In the second transfer step, a domain is composed of a d-dimensional feature space and a marginal probability distribution , the source data is denoted as , the target data is denoted as , , are the number of samples in the source and target domains, respectively. All subscripts represent samples from the source domain or transformed data from the source domain, same for . The label vector of data is denoted . is the number of classes. The symbol is the reproducing kernel Hilbert space (RKHS) norm. denotes the trace operator and denotes the nearest neighbors operator.

Brief description of proposed algorithm

The proposed TSTL based on PD speech data consists of two major steps: CSC&SF and JLSDA. In the first step (CSC&SF), its purpose is to learn useful information from public speech data (source domain) and transfer it to the target domain. First, the public speech dataset is expanded with noise injection into a larger one. Second, the features are extracted from the data, thereby constructing a speech feature dataset as the source domain. Then, the CSC learning method is carried out on the source domain datasets, and the kernel matrix is obtained. Based on the kernels, the target domain dataset is encoded to calculate the feature maps, and they are normalized to construct the norm feature map matrix. Row vectors of the same subject are expanded into a one row vector; and based on the Relief algorithm [67], the most effective features can be chosen to reduce the complexity of classification and constitute a new target dataset. In the second step (JLSDA), its purpose is to align the learned Parkinson data distribution and retain its original structure. The training set is looked as the new source domain, and the test set is looked as the new target domain. Both parts are mapped into a public manifold space through the JLSDA method. Finally, the refreshed training set and test set are put into the subsequent classifier for prediction.

First step transfer (FT)—CSC&SF

In CSC, given G training samples , the convolution kernel group is learned by minimizing the objective function as follows.where is block matrix, is feature map matrix, approximate the by convolving with the corresponding convolution kernel , the notation denotes the two-dimensional convolution, and is the regularization factor greater than zero, the solution to the above optimization problems are based on the fundamental classical framework alternating direction method of multipliers (ADMM) [68]. The formula (1) may be re-expressed aswhere , is the corresponding vectorizable convolution operator of , is feature map vector. The solution can be divided into the following two processes: Fixing convolution kernel to solve the feature maps, the formula (2) can be expressed as follows. which can be solved via ADMM iterations Fixing feature map to solve the convolution kernel, the formula (2) can be expressed as follows. In (5), , are the indicator function of convex set , in (6), compute proximal operator, which can be solved via ADMM iterations Finally, the set of sparse convolution kernel is obtained by alternating iteration. In order to transform the feature matrix into one row vector, is extended to as follows:where the feature extension of CSC expands row vectors of the same subject into one row vector; then normalize to obtain and based on the Relief algorithm, the weight of every features can be obtained. By setting number R, the most effective features can be chosen to reduce the complexity of classification and constitute a new target dataset . The pseudo-code description of the CSC&SF algorithm from Public datasets shown as follows.

Second step transfer (ST)—JLSDA

In this section, we propose to adapt distribution and keep the structure between samples by finding a public manifold space of source domain (training set) and target domain (test set): According to the key assumption in most unsupervised domain adaptation methods, , but [48]. In fact, this refers to minimizing the distance, which is the first part of the formula. The rest parts describe the relationships between samples in source or target domain. Figure 1 shows the main idea of JLSDA method. Different colors denote different domains and different shape denotes different classes. (a) shows the original data distribution of the source domain (training set) and target domain (testing set); (b) presents the relationship between the samples and the domains after the alignment of the source and target domain distributions, but the neighborhood structure relationship between samples is broken by only aligning the distribution, thus affecting the classification effect of classifier; (c) shows the samples’ relationship and domain’s distribution after JLSDA, all samples still maintain the original neighborhood relationship, while aligning the domain as (a) and this can be better for classification. The proposed method is described later.

Fig. 1

Illustration of the proposed JLSDA method. a The data distribution of the original source domain (training set) and target domain (testing set); b the relationship between the samples and the domains only after the alignment of the source and target domain distributions; c the data distribution after aligning the distribution of the source domain and the target domain and keeping the neighborhood structure relationship

Marginal distribution adaptation

Since there are no label in target, using the assumption that domain in this paper, but there exists a transformation such that and . A major computational issue is to reduce the distribution difference by explicitly minimizing proper distance measure. The distance between two distributions and can be empirically measured by MMD (Maximum Mean Discrepancy) [48, 60, 69], being written as: Thus, a sick nonlinear mapping can be found by minimizing the quantity. However, it is extremely difficult to solve the mapping and direct optimization of the quantity can stuck in poor local minima. According to the unsupervised dimensionality reduction method MMDE [70], both the source domain and target domain can be embedded into a public low-dimensional space by learning the kernel matrix . The kernel mapping can be considered as: , and . Specifically, after mapping, from source domain and from target domain can be written as: , thus, . In terms of trace operation trick, the distance between samples from source domain and target domain is equivalent to , and subject to constraints on is MMD matrices as the formula (10).

Local structure preservation

However, reducing the difference in the marginal distributions may destroy the relationship structure between samples, leading to the loss of useful information. Therefore, the affinity matrix here is to preserve the neighborhood structure. First, revisit a dimensionality reduction method called LPP [71]. With the manifold assumption, LPP aims to preserve optimally the neighborhood structure of data. The objective function of LPP can be formulated aswhere is the map of , is the affinity matrix, calculated in the following two manners [72]. Simple-minded: Heat-kernel: where is the kernel parameter. will be assigned a large value if is the neighborhood of . Based on this important idea, to remain the neighbor relationship of samples from source and target domain, neighborhood structure preservation of this paper can be defined as where is the map of samples from source domain, and is the map of samples from target domain. and are the affinity matrix for source and target domains.

Joint optimization

The proposed JLSDA pursues aligning the distribution of the source and target domains and preserving neighborhood structure. The former reduces the distribution differences between the source domain and the target domain from a large range so that the classifier can match the data better. The latter retains the structure between samples in each domain from a local range, making the original effective information not affected. Therefore, the distribution of alignment is combined with local structure preservation which is important for the small size of PD speech data. Additionally, manifold regularization is also used for local similarity preservation. The main idea of the model is shown in Fig. 1 to reduce distribution differences, while preserving the structure of each domain which may be conducive to classifier classification. Besides the (8), the joint local structure distribution alignment term can be shown as follows. According to the properties of matrix trace and our previous definition, the formula (15) can be simplified into the following formula. It is clear that the first part of (15) is similar to the result that discussed in Sect. 2.4.1. where is the kernel matrix and is the MMD matrices, both are obtained from samples after first transfer step. is and is . and are the Laplacian matrixes of source domain and target domain, and are both diagonal matrixes, is affinity matrix. For convenience, (16) can be simplified to formula (17) denotes dot multiplication of and , . Due to high computational cost of MMDE, a unified kernel learning method is adopted which utilizes an explicit low-rank representation [67, 73]. Hence, formula (18) can be acquired. In order to make the problem solution unique, constrain is introduced. is the mapping matrix, is dot multiplication of and , is the regularization term. On the basis of Lagrange multipliers method, problem (18) can be reformulated as where is a diagonal matrix containing Lagrange multiplier, Setting the derivative of (19) w.r.t. to zero, then The solutions in (20) are the leading eigenvectors of , . The pseudo-code description of the joint local structure distribution alignment algorithm (JLSDA) is shown as follows.

Experimental results and analysis

This section describes the experiments conducted to test the proposed method’s effectiveness for PD diagnosis, mainly including the following experiments: verify the validity of each step of the transfers; explore the impact of import parameters on classification results; compare the representative relevant methods and analyze the computational time.

Experimental condition

Data

Four speech datasets are adopted for verification: The DARPA TIMIT Acoustic–Phonetic Continuous Speech Corpus (TIMIT), Sakar [15], MaxLittle [3, 74], and DNSH dataset. The first dataset is used for source domain in the first transfer learning. This standard speech dataset TIMIT contains a total of 6300 sentences, 10 sentences spoken by each of 630 speakers, but there are only 240 samples available for us, including 40 men and 40 women speakers, each one with 3 sentences. The dataset is added with noise (from NOISEX-92 noise dataset) and expansion. As for PD speech datasets, Little et al. [74] and Sakar et al. [15] provided a speech data set for Parkinson's disease, respectively. The Sakar dataset is the second dataset. There are 40 subjects in Sakar Data, including 20 patients with Parkinson's disease (6 women, 14 men), 20 healthy people (10 women, 10 men). Each subject contains 26 Speech sample segments, and each speech segment contains a variety of pronunciation content, including continuous vowel letter pronunciation, number pronunciation, word pronunciation, and short sentence pronunciation. As for each speech sample, 26-dimensional linear and nonlinear features are extracted to form a feature vector. MaxLittle dataset is the third dataset. The dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). For more detailed information on the second and third datasets, please visit the website (https://archive.ics.uci.edu/ml/index.php). The fourth dataset was collected by the authors and the subjects are collected from the First Affiliated Hospital of the Army Military Medical University, Chongqing, China. The dataset contains recordings of 36 PD patients (16 female (mean ± standard deviation (std): 57.9 ± 9.0) and 20 male (mean ± std: 60.8 ± 10.6)) without receiving treatment (the average and standard deviation age of illness are 7.38 years and 3.58 years, respectively) and 54 PD patients (27 female (mean ± std: 59.7 ± 8.1) and 27 male (mean ± std: 63.2 ± 10.8)) after receiving medication (the average and standard deviation age of illness are 6.82 years and 3.50 years, respectively). Thirteen speech samples were recorded for each person and each speech sample contains 26 features. The recordings were recorded by a microphone (SONY ICD-SX2000) placed at 15 cm away from the participants. The anticipants were asked to read 13 specific characters including ‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’, ‘7’, ‘8’, ‘9’, ‘10’, ‘a’, ‘o’, and ‘u’. The speech extraction software was Praat, sampled at 44.1 kHz, with 16-bit resolution.

Experimental criteria

The classification accuracy, sensitivity, and specificity are adopted as the evaluation criteria of experimental results to verify the effectiveness of the proposed algorithm in this paper. The accuracy rate refers to the percentage of the samples that are judged correctly to the total number of samples. Sensitivity and specificity are two commonly used indicators to explain the accuracy of medical diagnostic tests. Since PD speech diagnosis is a binary classification task in this paper, the confusion matrix can be used to describe the composition of sensitivity and specificity clearly as shown in Fig. 2.

Fig. 2

Confusion matrix for two-class diagnosis of PD

Confusion matrix for two-class diagnosis of PD From the confusing matrix, the indicators used in this paper can be expressed as: The leave-one-subject-out (LOSO) method is applied here, according to the characteristics that multiple samples correspond to one subject in the dataset. This verification method can maximize the number of training samples in small samples case, thus can better reflect the potential of the classification algorithm. Moreover, all samples were sufficiently tested, so the test accuracy was closer to the results in the actual application scenario. Most of the existing algorithms are based on k-fold and holdout cross-validation methods, the training samples and test samples are possibly from the same subject, thereby leading to the classification accuracy is not realistic. Different from the two algorithms, the LOSO can guarantee that training samples and test samples are from different subjects, which can ensure that the classification accuracy is not unrealistic and consistent with the actual diagnosis.

Experimental configuration

The experiments use a 64-bit Windows 7 computer and the hardware parameters of the experiment platform are CPU (Intel i3-4170 M), 6 GB memory. The experiments run on Matlab R2018b. The set of parameters in this paper is as follows. In the first step transfer, the random seeds number is 10, the number of main training iterations, feature map iterations, and convolution kernel iterations are 100, 10, and 10, respectively. The number of convolution kernel is from 2 to 8, the size of convolution kernel is 8 * 8. In the second transfer step, the regularization parameter lambda is 0.01, kernel type is ‘rbf’, the bandwidth for rbf kernel gamma is 100, affinity matrix mode is “simple” mode and the number of nearest neighbors is 1.

Verification of different steps transfer learning

Performance of first step transfer learning

For convenience, the Sakar dataset is used as the target domain here. In the FT, the main achievement is to transfer the knowledge from TIMIT to Sakar dataset through CSC. The difference between the target domain transformations is shown in Fig. 3. The figure manifests that the information of target domain has increased significantly after transfer learning. Here, whether the information obtained from the source domain contributes to the classification accuracy of the target domain or not is still unknown. By using the Sakar dataset as an example, the data before and after the first step of transfer learning will be handled. The classification algorithms are KNN and SVM. The classification results are shown in Table 1.

Fig. 3

a Sonograms of source domain; b feature kernels extracted from source domain; c original target domain; d target domain after first step transfer

Table 1

First transfer classification accuracy for Sakar dataset

Method	ACC (%)	TPR (%)	TNR (%)
KNN	52.5 (LOSO)	55.0	50.0
SVM (linear)	50.0 (LOSO)	50.0	50.0
FT&KNN	90.0 (LOSO)	85.0	95.0
FT&SVM (linear)	92.5 (LOSO)	95.0	90.0

a Sonograms of source domain; b feature kernels extracted from source domain; c original target domain; d target domain after first step transfer First transfer classification accuracy for Sakar dataset The results are showed in Table 1. Direct classification accuracy on the Sakar dataset is bad. KNN shows a better result than SVM, with an accuracy of 52.5% and 50.0%, respectively. However, it demonstrates a remarkable improvement in classification accuracy with first step transfer from TIMIT. The accuracy reached 90% for KNN and even achieved 92.5% for SVM. The sensitivity and specificity also are improved for the two classifiers after FT. This result fully illustrates that the information learned from public speech data is conducive to the classification of the target domain that is the first step transfer learning is effective.

Performance of second step transfer learning

In the ST, the JLSDA method is used to diminish the distribution difference between training data and test data, enabling them to keep original local structure. Like the first step transfer experiments, the effectiveness of the method is validated by comparing classification accuracy of untransferred data and transferred data. Table 2 presents the experimental outcome of ST. Although the experimental results do not improve as significantly as FT, it still worked. The classification accuracies on the KNN classifier and the SVM classifier are increased by 15% and 12.5%, respectively.

Table 2

Second transfer classification accuracy for Sakar dataset

Method	ACC (%)	TPR (%)	TNR (%)
KNN	52.5 (LOSO)	55.0	50.0
SVM (linear)	50.0 (LOSO)	50.0	50.0
ST&KNN	67.5 (LOSO)	65.0	70.0
ST&SVM (linear)	62.5 (LOSO)	80.0	45.0

Second transfer classification accuracy for Sakar dataset

Performance of TSTL

The TSTL&KNN means with TSTL with KNN classifier. The TSTL&SVM means TSTL with SVM classifier. The first two experiments proved that every single step of transfer learning is helpful to classification results. In this part, FT is combined with ST algorithms into the two-step sparse transfer learning algorithm. The experimental results are further improved as shown in Table 3. KNN achieved an accuracy of 94.5%, and SVM reached even more about 97.5%, similar to its sensitivity and specificity. The TSTL method has a great effect on the classification accuracy of the final results.

Table 3

Two-step sparse transfer classification accuracy for Sakar dataset

Method	ACC (%)	TPR (%)	TNR (%)
KNN	52.5 (LOSO)	55.0	50.0
SVM (linear)	50.0 (LOSO)	50.0	50.0
TSTL&KNN	94.5 (LOSO)	94.5	94.5
TSTL&SVM (linear)	97.5 (LOSO)	97.5	97.5

Two-step sparse transfer classification accuracy for Sakar dataset Figure 4 shows that t-SNE visualizations of the effect of the proposed method. Different colors represent samples of different domains. (a), (c), (e) represent the data distribution of the Sakar dataset, MaxLittle dataset, and DNSH dataset before TSTL, and (b), (d), (f) represent the data distribution of the Sakar dataset, MaxLittle dataset, and DNSH dataset after TSTL. It is manifest that data distribution is more compact and even than before TSTL.

Fig. 4

The t-SNE visualizations of TSTL on three PD speech datasets. a Non-TSTL on Sakar; b TSTL on Sakar; c non-TSTL on MaxLittle; d TSTL on MaxLittle; e Non-TSTL on DNSH; f TSTL on DNSH

Comparison with unsupervised domain adaptation algorithms

Although there are many studies on Parkinson's classification, there is almost no UDA for Parkinson's speech. Table 4 shows the comparison of the proposed method and four typical UDA methods: two traditional UDA and two deep UDA methods (DAN, DANN). Each method was tested on the Sakar dataset, MaxLittle dataset and DNSH dataset, under LOSO cross-validation.

Table 4

The comparison of the UDA classification result of the proposed algorithm based on three datasets

Method	Dataset	ACC (%)	TPR (%)	TNR (%)
TCA (LOSO)	Sakar	55.00	65.00	45.00
	MaxLittle	75.00	100.00	0.00
	DNSH	46.88	59.38	34.38
CORAL (LOSO)	Sakar	52.50	50.00	55.00
	MaxLittle	75.00	100.00	0.00
	DNSH	48.44	56.52	40.63
	Sakar	62.50	65.00	60.00
DAN (LOSO)	MaxLittle	66.88	84.17	15.00
	DNSH	45.94	45.66	46.25
	Sakar	54.25	54.50	54.00
DANN (LOSO)	MaxLittle	72.81	93.75	10.00
DANN (LOSO)	DNSH	47.67	54.06	41.25
TSTL (LOSO)	Sakar	97.50	97.50	97.50
	MaxLittle	96.87	100.00	87.50
	DNSH	90.63	90.63	90.63

ACC accuracy; TPR true positive rate; TNR true negative rate

The comparison of the UDA classification result of the proposed algorithm based on three datasets ACC accuracy; TPR true positive rate; TNR true negative rate Compared with other four UDA methods, the ACC of TSTL presents the best results on the three PD datasets. It is noticeable that the effects of the four comparison methods are not ideal, even not reach 50% in the real-world dataset. Moreover, there is no obvious difference in these PD datasets regardless of whether it is deep or non-deep methods. To a certain extent, although these UDA methods have relatively strong versatility, for relatively special datasets such as Parkinson's speech data, to achieve high ACC, it is necessary to design corresponding algorithms according to its data characteristics. Due to the imbalance of positive and negative samples in the MaxLittle, TPR and TNR have a great difference in the four compared methods. However, the proposed method learned speech structure to enhance the generalization ability of the classifier. In general, the proposed algorithm is better than the popular domain adaptation algorithms.

Effect of parameters on the proposed algorithm’s performance

Effect of convolution kernel number

Convolution kernel is one of the main parameters of TSTL, therefore, it is necessary to study its effect on the performance of algorithm. For the Sakar dataset, when the convolution kernel number is taken from 2 to 8, after 10 repetitions, the relationship between the number of the convolution kernels and the classification accuracy is shown in Fig. 5. The abscissa represents the number of convolution kernels. With different numbers of convolution kernels, each convolution kernel corresponds to a result. The ordinate is the classification accuracy. Each convolution kernel will produce a corresponding result.

Fig. 5

Relationship between the convolution kernel number and classification accuracy for Sakar dataset

Relationship between the convolution kernel number and classification accuracy for Sakar dataset All convolution results are more than 90% of the bar graph. Comparing all the results, the classification accuracy rate has a minimum value when the kernel number is 4, and a maximum value when the kernel number is 6. Therefore, the feature kernel leading high classification result can be chosen to obtain a suitable feature map in actual operation. Overall, the results are relatively satisfactory.

Effect of neighbor sample number

The nearest neighbor sample number plays a critical role in preserving the neighborhood structure of TSTL. According to the characteristics of Sakar PD speech datasets, nearest neighbor sample number K from 1 to 8 is selected for experiments. Relative accuracy is adopted here to intuitively explore the effect of the number of neighbors on the results. Figure 6 depicts the relative accuracy of maximum accuracy, average accuracy, and minimum accuracy for different numbers of neighbor samples. The results show that average accuracy reaches the maximum value when K is 5. The maximum accuracy and average accuracy is with the same case. The slope linear regression through the data point is adopted to show the relationship between three relative accuracies and neighbor samples. When K is less than 5, the relative accuracy of maximum accuracy, average accuracy, minimum accuracy increase with neighbor samples at the rate of 0.027778, 0.040506, and 0.38462. While K is greater than 5, the relative accuracy of maximum accuracy, average accuracy, minimum accuracy decreases with neighbor samples at the rate of − 0.04444, − 0.0481, − 0.06923, respectively. So it seems that too many or too few neighbor samples will not have a positive effect on classifier classification. Therefore, it is necessary to find a suitable neighbor sample number for classification.

Fig. 6

Relative accuracy of neighbor sample number on Sakar dataset

Comparison with representative PD algorithms

On the Sakar dataset, the comparison results of the proposed algorithm with other representative algorithms are presented in Table 5. Excepting the relevant published algorithms of PD speech diagnosis, the proposed algorithm is also compared with the relevant deep learning algorithms, including the DBN, CNN, and deep autoencoder algorithm.

Table 5

The comparison of the classification result of the proposed algorithm based on Sakar dataset

Study	Method	ACC (%)	TPR (%)	TNR (%)
Canturk and Karabiber [75]	4 Feature Selection Methods & 6 Classifiers	57.50	54.28	80.00
Eskidere et al. [76]	Random Subspace Classifier Ensemble	74.17	–	–
Zhang et al. [77]	MENN&RF	81.50	92.50	70.50
Benba et al. [78]	HFCC + SVM	87.50	90.00	85.00
Li et al. [79]	Hybrid feature learning&SVM	82.50	85.00	80.00
Vadovsk and Parali [80]	C4.5&C5.0&RF &CART	66.50	–	–
Zhang[81]	LSVM&MSVM &RSVM&CART &KNN&LDA&NB	94.17	50.00	94.92
Benba et al. [82]	MFCC&SVM	82.50	80.00	85.00
Kraipeerapun and Amornsamanku [83]	Stacking&CMTNN	75.00	–	–
Khan et al. [84]	Evolutionary neural network ensembles	90.00	93.00	97.00
Ali et al. [85]	LDA-NN-GA	95.00	95.00	95.00
–	DBN	54.60	52.40	56.80
–	CNN	60.00	63.00	57.00
–	DBN&SVM	50.50	53.00	48.00
–	Autoencoder&SVM	67.50	65.00	70.00
Proposed algorithm	TSTL&SVM	97.50	97.50	97.50

ACC accuracy; TPR true positive rate; TNR true negative rate

The comparison of the classification result of the proposed algorithm based on Sakar dataset ACC accuracy; TPR true positive rate; TNR true negative rate As shown in Table 5, for the Sakar dataset, it is difficult to achieve an excellent result, and only a handful of algorithms can reach an accuracy of 90%. From the methodological point, it is evident that deep learning methods are not better than traditional machine learning methods and most traditional methods have higher accuracy than the former. This also confirmed that the deep learning method is not suitable for the datasets of the small sample such as the Parkinson's speech dataset since it requires a large number of samples to train a good model. Holdout, K-fold, and LOSO three different cross-validation methods are used in different algorithms above. But strictly speaking, the LOSO method is more suitable for the evaluation of Parkinson's speech data model, because one subject contains more than one speech sample and the LOSO method can ensure samples of the training set and test set are from the different subject. Not the same as LOSO, the training set and test set of K-fold and Holdout may contain samples from the same subject, make the prediction results in the experiment better than the prediction results in the real application scenario. As to the Holdout method, the final evaluation result has a great relationship with the order of the original data. In terms of accuracy, the average accuracy rate of the proposed algorithm (TSTL&SVM) reached 97.5% and achieved better results than other methods. Table 6 shows the classification and comparison results of this proposed algorithm and the representative algorithms on the MaxLittle dataset. The proposed algorithm is compared with the other representative algorithms on this dataset. Besides, the proposed algorithm is compared with the most relevant algorithms, including the SVM with linear and radial kernels, DBN, CNN and the deep autoencoder algorithm.

Table 6

The comparison of the classification result of the proposed algorithm based on MaxLittle dataset

Study	Method	ACC (%)	TPR (%)	TNR (%)
Little et al. [74]	Preselection filter + exhaustive search + SVM	91.40	–	–
Shahbaba and Neal [86]	Dirichlet process mixtures	87.70	–	–
Psorakis et al. [87]	mRVMs	89.47	–	–
Guo et al. [88]	GA-EM	93.10	–	–
Sakar and Kursun [27]	Mutual information + SVM	92.75	–	–
Das [89]	ANN decision tree	92.90	–	–
Ozcift and Gulten [90]	Correlation-based feature selection-rotation forest	87.10	–	–
Luukka [91]	Fuzzy entropy measures + similarity	85.03	–	–
Li et al. [92]	Fuzzy-based nonlinear transformation + SVM	93.47	–	–
Spadoto et al. [93]	PSO + OPF harmony search + OPF gravitational search + OPF	84.01	–	–
Polat [94]	FCMFW + KNN	97.93	–	–
Chen et al. [95]	PCA-fuzzy KNN	96.07	–	–
Ali et al. [17]	DBN	94.00	–	–
Åström and Koker [96]	Parallel ANN	91.20	90.50	93.00
Daliri [97]	SVM with Chi-square distance kernel	91.20	91.71	89.92
Zuo et al. [98]	PSO-fuzzy KNN	97.47	98.16	96.57
Kadam and Jadhav [99]	FESA-DNN	93.84	95.23	90.00
Ma et al. [100]	SVM-RFE	96.29	95.00	97.50
Cai et al. [19]	RF-BFO-SVM	97.42	99.29	91.50
Dash et al. [101]	ECFA-SVM	97.95	97.90	–
Gürüler [102]	KMCFW-CVANN	99.52	100.00	99.47
–	SVM (linear kernel)	75.00	100.00	0.00
–	SVM (RBF kernel)	75.00	100.00	0.00
Proposed algorithm	TSTL&SVM	96.87	100.00	87.50

ACC accuracy; TPR true positive rate; TNR true negative rate

The comparison of the classification result of the proposed algorithm based on MaxLittle dataset ACC accuracy; TPR true positive rate; TNR true negative rate As shown in Table 6, the compared methods on the MaxLittle dataset are based on hold-one-out and tenfold. The holdout is more contingent, and even when tenfold is adopted, there is still no deliberate effort to avoid the fact that the training samples and test samples come from the same subject. Therefore, the accuracies of the methods are unreliable since they are perhaps higher than they would be in practice. As Table 6 shows under LOSO, the proposed algorithm achieves 96.87%. Although the accuracy is lower than some comparison algorithms, the accuracy is based on LOSO and reliable since it more reflects the real accuracy. It can also be found from Table 7 that the proposed algorithm achieves the best results on the DNSH dataset. The SVM and KNN are adopted as popular classifiers. Outperforming the SVM, the average classification accuracy of the proposed algorithm reaches 90.63%, proving that it is quite effective even on the DNSH dataset of Chinese PD patients.

Table 7

The comparison of the classification result of the proposed algorithm based on DNSH dataset

Study	Method	ACC (%)	TPR (%)	TNR (%)
–	KNN	52.5 (LOSO)	55.0	50.0
–	SVM (linear kernel)	50.0 (LOSO)	50.0	50.0
Proposed algorithm	TSTL&SVM	90.63 (LOSO)	90.63	90.63

The comparison of the classification result of the proposed algorithm based on DNSH dataset

Analysis of computational time

First, the Table 8 presents the run time of the proposed algorithm on Sakar, MaxLittle, DNSH datasets, respectively. The computation time of the proposed algorithm on the Sakar dataset under different subject size is provided in Fig. 7. The subject size means the size of features * segments. For example, there are 16 features and 16 segments, so the subject size is. The run time includes the total time cost for dealing with the training set and test set. Notably, all the procedures are implemented in the computer of Intel Core i3CPU, 3.7 GHz, and 6 GB RAM.

Table 8

The time cost of the proposed algorithm on PD speech datasets

Dataset	Sakar	MaxLittle	DNSH
Time cost (s)	25.188	3.269	18.133

Fig. 7

Time cost of different sample size on Sakar dataset

The time cost of the proposed algorithm on PD speech datasets Time cost of different sample size on Sakar dataset Seeing from Table 8, the time costs of the proposed algorithm on the three PD datasets are acceptable in practical applications. Seeing from Fig. 6, the computational time and the slope increase as the subject size increases. But the more the subject size is, the better accuracy will be. Therefore, it is necessary to find the suitable subject size for a satisfactory balance. As described above, the apt feature extraction and the coordinate selection of samples and features are needed.

Conclusions

This paper presents unsupervised two-step sparse transfer learning method (TSTL), an efficient approach to replenish information of PD speech samples and reduce the distribution differences between the source domain and target domain. The TSTL method works well on various representative PD speech datasets. Unlike previous PD classification methods, the proposed method used CSC to learn efficient speech structure and designed JLSDA to eliminate discrepancy between training set and test set. The TSTL shows effective results not only on two public datasets but also on the real-world dataset collected by the authors. In the future study, the proposed method will be applied into the PD speech data and motion sensor data together. Besides, the method will be considered for other neurological diseases diagnosis with small sample size.

34 in total

1. Multiclass relevance vector machines: sparsity and accuracy.

Authors: Ioannis Psorakis; Theodoros Damoulas; Mark A Girolami
Journal: IEEE Trans Neural Netw Date: 2010-08-30

2. Vowel articulation in Parkinson's disease.

Authors: Sabine Skodda; Wenke Visser; Uwe Schlegel
Journal: J Voice Date: 2010-05-01 Impact factor: 2.009

3. Transferable Representation Learning with Deep Adaptation Networks.

Authors: Mingsheng Long; Yue Cao; Zhangjie Cao; Jianmin Wang; Michael I Jordan
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2018-09-05 Impact factor: 6.226

4. Heterogeneous Domain Adaptation: An Unsupervised Approach.

Authors: Feng Liu; Guangquan Zhang; Jie Lu
Journal: IEEE Trans Neural Netw Learn Syst Date: 2020-03-03 Impact factor: 10.451

5. Deep Learning Classification Models Built with Two-step Transfer Learning for Age Related Macular Degeneration Diagnosis.

Authors: Guangzhou An; Masahiro Akiba; Hideo Yokota; Naohiro Motozawa; Seiji Takagi; Michiko Mandai; Shohei Kitahata; Yasukiko Hirami; Masayo Takahashi; Yasuo Kurimoto
Journal: Conf Proc IEEE Eng Med Biol Soc Date: 2019-07

Insight into an unsupervised two-step sparse transfer learning algorithm for speech diagnosis of Parkinson's disease.

Introduction

Related works

The proposed method

Problem formulation

Brief description of proposed algorithm

First step transfer (FT)—CSC&SF

Second step transfer (ST)—JLSDA

Marginal distribution adaptation

Local structure preservation

Joint optimization

Experimental results and analysis

Experimental condition

Data

Experimental criteria

Experimental configuration

Verification of different steps transfer learning

Performance of first step transfer learning

Performance of second step transfer learning

Performance of TSTL

Comparison with unsupervised domain adaptation algorithms

Effect of parameters on the proposed algorithm’s performance

Effect of convolution kernel number

Effect of neighbor sample number

Comparison with representative PD algorithms

Analysis of computational time

Conclusions

1. Multiclass relevance vector machines: sparsity and accuracy.

2. Vowel articulation in Parkinson's disease.

3. Transferable Representation Learning with Deep Adaptation Networks.

4. Heterogeneous Domain Adaptation: An Unsupervised Approach.

5. Deep Learning Classification Models Built with Two-step Transfer Learning for Age Related Macular Degeneration Diagnosis.

6. Telediagnosis of Parkinson's disease using measurements of dysphonia.

7. Unsupervised Transfer Learning via Multi-Scale Convolutional Sparse Coding for Biomedical Applications.

8. Collection and analysis of a Parkinson speech dataset with multiple types of sound recordings.

9. Prosodic analysis of neutral, stress-modified and rhymed speech in patients with Parkinson's disease.

10. Can a Smartphone Diagnose Parkinson Disease? A Deep Neural Network Method and Telediagnosis System Implementation.