Literature DB >> 30938684

A Stroke Risk Detection: Improving Hybrid Feature Selection Method.

Yonglai Zhang1, Yaojian Zhou1, Dongsong Zhang2, Wenai Song1.   

Abstract

BACKGROUND: Stroke is one of the most common diseases that cause mortality. Detecting the risk of stroke for individuals is critical yet challenging because of a large number of risk factors for stroke.
OBJECTIVE: This study aimed to address the limitation of ineffective feature selection in existing research on stroke risk detection. We have proposed a new feature selection method called weighting- and ranking-based hybrid feature selection (WRHFS) to select important risk factors for detecting ischemic stroke.
METHODS: WRHFS integrates the strengths of various filter algorithms by following the principle of a wrapper approach. We employed a variety of filter-based feature selection models as the candidate set, including standard deviation, Pearson correlation coefficient, Fisher score, information gain, Relief algorithm, and chi-square test and used sensitivity, specificity, accuracy, and Youden index as performance metrics to evaluate the proposed method.
RESULTS: This study chose 792 samples from the electronic records of 13,421 patients in a community hospital. Each sample included 28 features (24 blood test features and 4 demographic features). The results of evaluation showed that the proposed method selected 9 important features out of the original 28 features and significantly outperformed baseline methods. Their cumulative contribution was 0.51. The WRHFS method achieved a sensitivity of 82.7% (329/398), specificity of 80.4% (317/394), classification accuracy of 81.5% (645/792), and Youden index of 0.63 using only the top 9 features. We have also presented a chart for visualizing the risk of having ischemic strokes.
CONCLUSIONS: This study has proposed, developed, and evaluated a new feature selection method for identifying the most important features for building effective and parsimonious models for stroke risk detection. The findings of this research provide several novel research contributions and practical implications. ©Yonglai Zhang, Yaojian Zhou, Dongsong Zhang, Wenai Song. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 02.04.2019.

Entities:  

Keywords:  WRHFS; feature selection; machine learning; risk; stroke

Mesh:

Year:  2019        PMID: 30938684      PMCID: PMC6466481          DOI: 10.2196/12437

Source DB:  PubMed          Journal:  J Med Internet Res        ISSN: 1438-8871            Impact factor:   5.428


Introduction

Background and Research Objective

Stroke is the second most popular cardiovascular disease (CVD). The World Health Organization estimated that 17.7 million people died from CVDs in 2017, of which 6.7 million had stroke, representing 31% of all deaths caused by diseases in the world [1]. The epidemiological characteristics of stroke in developing countries have gradually become closer to those of developed countries [2]. The prevalence and mortality of stroke are still on the rise. As of 2016, there were 13 million people with stroke in China [3]. Stroke prevention was the theme set by the World Stroke Organization for the 2017 World Stroke Day. Therefore, timely detection and prevention of stroke become essential. People may go to a hospital for a full physical examination to assess stroke risk. Specific examination items include blood biochemical tests, blood pressure, electrocardiogram, vascular ultrasound, vascular computerized tomography angiography, magnetic resonance angiography, electroencephalography, magneto encephalography, single photon emission computerized tomography, positron emission computerized tomography, magnetic resonance imaging, and digital subtraction angiography. Plaque image analysis based on image segmentation technology has also been explored for risk detection of strokes [4,5]. As traditional medical risk assessment is expensive and not scalable, automated detection of stroke risk has been increasingly studied in recent years (eg, [6-11]), which falls into 2 broad categories: stroke risk assessment modeling and brain image analysis. Many countries have employed automated detection models for stroke, such as systematic coronary risk evaluation [12], QRISK (QRFSEARCH cardiovascular risk algorithm) [13], and Reynolds risk score [14]. The Framingham risk assessment model is a typical risk detection model of stroke. Pencina et al used an extended Framingham model to develop a 30-year risk detection model with data collected from 4506 patients aged 20 to 59 years [15]. The model detected the 30-year risk using 8 risk factors, including gender, antihypertensives, blood pressure, total cholesterol, high-density lipoprotein (HDL), smoking, impaired glucose tolerance, and left ventricular hypertrophy. Flueckiger et al further extended the Framingham model to establish a score detection model of stroke in a multiethnic study of atherosclerosis in conjunction with nontraditional risk markers [16]. The detection model included demographics, medical history, anthropometrics, and conventional risk factors. However, the Framingham model overestimates the risk of stroke in China because of obvious differences in the disease spectrum and risk factors [17,18]. A joint Chinese-American research group constructed a risk detection model of ischemic stroke and hemorrhagic stroke with 6 risk factors, including systolic blood pressure, sex, age, total cholesterol (TC), diabetes, and smoking [19]. Stroke consists of ischemic stroke and hemorrhagic stroke. Ischemic stroke accounts for 60% to 80% of stroke occurrence in China, which is the main context of this study. The detection of risk for ischemic stroke is aimed to reduce or prevent the incidence of clinical events and premature death associated with ischemic stroke by early prevention. A key limitation of existing research on stroke risk assessment lies in the lack of systematic guidance for feature selection while building stroke risk detection models, which is essential to the performance of such models. Previous studies chose predictive features largely in an ad hoc manner and did not incorporate the latest results of medical research. So, the core research question of this study is how to select important risk factors that should be included in a risk detection model for ischemic stroke as predictive features? To address this research question, we proposed, developed, and evaluated a new hybrid feature selection method, namely weighting- and ranking-based hybrid feature selection (WRHFS). WRHFS integrates the strengths of various filter algorithms and deploys continuous weighting and ranking of individual features by following the principle of a wrapper approach. It then selects the top N ranked features as the most important features. This study makes a significant research contribution by proposing a new methodological approach to feature selection, which can lead to improved performance of risk detection models.

Related Work

The key to accurate stroke risk detection is to select the most important and influential features of stroke patients, which may vary among patients at different regions. Past research has shown that stroke is significantly associated with age [20], gender [21], blood pressure [20,21], low-density lipoprotein [22], triglyceride [23], drinking [24], smoking [25], creatine kinase (CK) [25], height [26], TC [27], HDL [24,27], body mass index (BMI) [22,25,28], serum total cholesterol [22,29], smoking [22,24,30], and diabetes [22,31]. Recently, some new risk factors have been discovered by medical research. For example, alkaline phosphatase [32] and hypercholesterolemia [33] are found to increase the probability of the mortality of stroke patients. Studies have also shown that there is a clear epidemiological relationship between stroke risk and hyperlipidemia [34]. However, no single study has used all features that are theoretically related to stroke because of their availability in data. Traditionally, detectors of stroke risk were identified based on the findings of medical research and practice. However, collecting data for risk factors (also referred to as features in this paper) based on the results of medical research is extremely difficult. In the past decade, there has been increasing research on building automated stroke risk detection models by leveraging machine-learning techniques and patient data. One of the essential steps in building such models is to select effective features (ie, influential factors) that are associated with stroke, which is often referred to as the feature selection process. We categorized feature selection methods used in automated stroke risk detection models into semisupervised, unsupervised, and supervised methods [35,36], as summarized in Table 1. Semisupervised feature selection methods are suitable for datasets with a small number of labeled samples and a large number of unlabeled samples [37]. The key challenge lies in how to use the labeled samples to efficiently process the unlabeled samples. At present, unsupervised feature selection methods mainly focus on clustering-based models, for example, Laplacian score [38], trace ratio [39], and sparsity regularization–based models [40]. For example, a coregularized unsupervised feature selection algorithm was proposed in a study by Zhu et al [41], which was intended to ensure that the selected features could preserve both data distribution and reconstruction.
Table 1

Classification of feature selection methods.

MethodsRationaleLimitationsSample studies
Supervised
FilterMutual information basedSignal objective function[42]
Ranking basedNeglecting the correlation between the features and class labels[43]
Weighting basedLacking the uniform standards of selecting features[44]
WrapperEvaluating the accuracy of the classifierOverfitting and high computational complexity[45]
HybridGuiding the wrapper using a filterOnly for certain specific fields[46]
SemisupervisedGuiding by the labeled samplesRelying on small labeled samples[37]
UnsupervisedClustering-based modelsRelying on certain data distribution[40]
Classification of feature selection methods. Supervised feature selection methods can be further divided into filter, wrapper, and hybrid methods. The filter feature selection method consists of mutual information and ranking- and weighting-based methods. Mutual information–based filter methods use mutual information to evaluate the relevance of features to class labels and the redundancy of candidate features. However, they suffer from the problem that the objective function only uses a single statistic measure of a dataset (eg, standard deviation, information gain [42,47], or Fisher score [48]), while ignoring the fusion of multiple measures. For example, a standard deviation–based filter model relies on the distance between feature value and mean value for feature selection. Information entropy is often used to measure the uncertainty of the value of a random variable. Information gain, referred to as the change in information entropy, of a feature in a dataset can be used to rank features. The greater the information gain is, the more a feature contributes to classification. Feature ranking methods (eg, maximal relevance and minimal redundancy objective [43]) are independent of classification algorithms. They select a feature subset with metrics such as the Relief algorithm [49-51] and correlation estimate [43,52]. The Relief algorithm has been successfully applied to feature weighting because of its simplicity and effectiveness [41,42,47]. It is inspired by instance-based learning algorithms according to their ability to discriminate neighboring patterns. Linear in-time complexity, Relief has a great advantage in computational efficiency. It selects a sample x randomly and then finds the nearest neighbor sample NearHit(x) in the same class and the nearest neighbor sample NearMiss(x) in another class. However, its significant disadvantage lies in that feature ranking overemphasizes the relevance of a certain feature to a class label or the correlation with other individual features based on a single objective function, while neglecting the correlation between the combined features and a class label. In addition, when the independent relevance of a feature is emphasized, the redundancy of feature ranking will be increased, which contradicts to the objective of minimization of redundancy. Feature weighting methods attempt to assign a weight value, usually in the range of 0 to 1, to each feature. Features with weights near 1 will be selected to form a feature set, whereas other features will be discarded [44]. Those methods are lacking the uniform standards for selecting features because of the fuzziness of near 1. Overall, filter models select features by weighting and ranking features based on their statistical relevance to class labels and a threshold to filter out irrelevant features to improve the classification accuracy [53]. Wrapper methods search for the optimal subset of features in a feature space and use a classifier to evaluate the effectiveness of a feature subset. For a particular classifier, wrapper methods may find good feature subsets [45]. However, they are prone to overfitting and high computational complexity. Hybrid models use a filter model to guide a wrapper model to solve these problems of filter and wrapper methods [46,54-56]. In summary, for stroke risk detection, traditional feature selection methods have a variety of limitations, negatively affecting the quality of selected features and the performance of stroke risk detection models.

Methods

Design

In this study, we proposed a new hybrid feature selection model called WRHFS, which selects features by integrating various filter and wrapper methods. Being different from previous hybrid methods, WRHFS selects the best n filter models (in this study, n=3) from a candidate set to guide a wrapper model. Figure 1 shows the process of WRHFS, which consists of 4 parts. WRHFS selects the top 3 filter methods from a set of candidate filter models.
Figure 1

Weighting- and ranking-based hybrid feature selection.

Filter stage: ranking features with multiple filter models. (1) Randomly choosing 3 different models from a set of candidate filter models. Weighting- and ranking-based hybrid feature selection. (2) Ranking features based on each filter model. WRHFS uses the ordered features of the filter models to train multiple classification models based on the backward searching strategy and measures classification accuracies and contribution vectors ω from the 3 filter models. Wrapper stage: constructing an aggregated contribution vector W using the 3 contributions of individual features from the 3 filter models. (3) Creating W by aggregating the 3 contribution vectors ωi (i=1, 2, and 3) generated by the 3 filter models. W is expressed as follows: Voting stage: voting features based on a contribution matrix. (4) Building a classification contribution matrix C based on the 3 contribution vectors. C is expressed as follows: (5) Building a cumulative classification contribution matrix D on the 3 contribution vectors ωi. D1, D2, and D3 are the cumulative contribution vectors based on the vectors ωi, respectively. D is expressed as follows: Assessment stage: assessing the effectiveness of the 3 filter feature selection models and selecting the most important features. (6) Building the effectiveness coefficient vector P of the 3 models and assessing the effectiveness of those models. P is a 3-dimensional vector, which contains the effectiveness coefficients of the 3 filter models. P is defined as follows: WRHFS assesses the first 3 filter models according to the effectiveness coefficient vector P, then replaces the worst filter model by choosing a filter model from the remaining filter models in the candidate set. Then, it repeats steps (1) to (6) until the candidate model set is empty. Finally, the top 3 filter models will be chosen based on the effectiveness coefficient vector P to develop the optimal feature selection model. (7) Calculating the weight Wr of each selected individual feature. The formula of Wr is as follows: (8) Ranking features based on the weights Wr. (9) Selecting the top N key features based on their weights, in which the cumulative contribution of key features is more than 50% and generating the risk index map for diseases using the surface fitting technique based on key features.

Performance Measures

We evaluated the performance of WRHFS in terms of sensitivity, specificity, accuracy, and Youden index using a real-world dataset. We adopted the most common performance measures of classification models in medical diagnostics, including sensitivity, specificity, accuracy, and Youden index. There are 4 categories of potential outcomes: true positive (people with ischemic risk correctly identified), false positive (healthy people incorrectly identified as having risk), true negative (healthy people correctly identified as healthy), and false negative (people with ischemic stroke incorrectly identified as without risk). Sensitivity (also called the true positive rate or recall) measures the proportion of actual positives correctly detected as people with stroke risk, as shown in equation (6). Specificity (also called the true negative rate) measures the proportion of actual negatives that are correctly identified as healthy people, as shown in equation (7). Accuracy is defined as equation (8). Among the 3 measures, sensitivity is the most important medical criterion. Youden index, also called Youden J statistic, captures the performance of a dichotomous diagnostic test. Youden index is defined in equation (9). We employed 6 filter methods commonly used in the medical field, including those based on standard deviation [57], Pearson correlation coefficient [58], Fisher score [59], information gain [60], Relief [61], and chi-squared test [62]. We used a 10-fold cross-validation to train and test classification models. In the evaluation, we selected methods based on standard deviation, Pearson correlation coefficient, and Fisher score, initially. We adopted support vector machine (SVM), Bayes [63], classification based on associations [64], back-propagation neural networks [65], classification and regression tree [66], C4.5 (the decision tree learner) [67], and extreme learning machine [68] to build different detection models because they are the commonly used classification algorithms. Afterward, we kept the top 3 filter methods as the benchmark feature selection models and compared their performances against that of the proposed WRHFS method.

Results

Dataset

This study adopted a retrospective cohort. We collected a dataset that consisted of records of 80,672 patients from a community hospital. Among them, 13,421 patients suffered from ischemic stroke in the past 5 years. We extracted their records before their diagnoses of ischemic stroke. Given the purpose of modeling, we only chose and used features that did not have missing values in the entire dataset. We did not use missing value supplementation techniques because of concerns of possible biases or noises that may incur when applying those techniques. At the end, there were 792 complete records in the dataset, each including 24 blood test features, as shown in Table 2. We also included 4 demographic features of the patients, including gender, age, height, and BMI. Descriptive statistics of age and gender are reported in Table 3. Among the 792 qualified patient records, 398 were diagnosed with ischemic stroke and labeled as class 1 instances, whereas the remaining 394 were not diagnosed with ischemic stroke and were labeled as class 2 instances.
Table 2

24 blood test items.

Full nameAbbreviationUnitType of data
α Hydroxybutyric dehydrogenaseα- HBDIU/LInteger
Gamma glutamyl transpeptidaseGGPIU/LInteger
Lactate dehydrogenaseLDHmmol/LReal
Low-density lipoproteinLDLmmol/LReal
High-density lipoproteinHDLmmol/LReal
Blood urea nitrogenBUNmmol/LReal
Uric acidUAumol/LInteger
Total cholesterolTCmmol/LReal
Total bilirubinTBILumol/LReal
Total proteinTPg/LInteger
TriglycerideTGmmol/LReal
AlbuminAlbg/LInteger
Direct bilirubinDBILumol/LReal
Alkaline phosphataseALPIU/LInteger
Serum phosphorusPImmol/LReal
Serum creatinineSCrumol/LInteger
Creatine kinaseCKIU/LInteger
Creatine kinase isoenzymeCK-MBIU/LInteger
GlucoseGlummol/LReal
Alanine aminotransferaseALTIU/LInteger
Aspartate aminotransferaseASTIU/LInteger
Apolipoprotein A1Apo-A1g/LReal
Apolipoprotein BApo-Bg/LReal
Serum calciumCammol/LReal
Table 3

Descriptive statistics of age and gender of patients in the dataset (N=792)

Age (years) and genderStatistics, n (%)
≥ 45 and ≤60
Male105 (13.3)
Female167 (21.1)
>60 and ≤75
Male151 (19.1)
Female246 (31.1)
>75 and ≤90
Male76 (9.6)
Female47 (5.9)

Weighting of Features Using Weighting- and Ranking-Based Hybrid Feature Selection

WRHFS assessed the effectiveness of filter feature selection methods given in the dataset. Afterward, we discarded the filter method with lower effectiveness coefficients. The greater the coefficient, the higher the effectiveness. As shown in Table 4, information gain, Relief, and standard deviation led to the top 3 model performances. Table 5 shows the weights of the features based on standard deviation The weights of the features based on Relief and information gain are shown in Multimedia Appendices 1 and 2. Here, “accuracy” refers to the classification accuracy of the SVM classifier on the basis of the backward searching strategy, whereas “C” and “q” indicate the optimal penalty parameter and the kernel bandwidth in the SVM algorithm, respectively. Feature contribution is reflected by the difference between the accuracy of a model including a specific feature versus the accuracy of the model without it. We used the normalized result between 0 and 1 to eliminate the difference between positive and negative. “SD (0-1)” and “Contribution (0-1)” indicate the normalized results of “accuracy” and “contribution”, respectively. “Weight” reflects the overall performance of the features, which is the sum of standard deviation (0-1) and the contribution (0-1).
Table 4

Effectiveness coefficients of the filter feature selection methods.

MethodEffective coefficient
Information gain63
Relief61
Standard deviation52
Pearson correlation coefficient49
Fisher score46
Chi-squared test40
Table 5

Weighting of the 28 features based on standard deviation.

FeatureaStandard deviationCqAccuracy (%)ContributionSD (0-1)Contribution (0-1)Weight
CK0.21160.556.1b1.001.00
LDH0.21640.557.11.000.990.301.00
α-HBD0.191288.058.61.520.910.371.52
Height0.1788.057.3−1.260.810.00−1.26
ALP0.1524.058.81.520.720.371.52
UA0.1018.058.6−0.250.480.13−0.25
SCr0.09164.061.93.280.410.603.28
GGP0.08162.061.6−0.250.400.13−0.25
TP0.0828.061.60.000.370.170.00
AGE0.08641.067.96.310.361.006.31
ALT0.071281.067.6−0.380.310.12−0.38
AST0.06641.067.90.380.270.220.38
CK-MB0.051280.269.21.260.230.331.26
Alb0.051280.169.60.380.220.220.38
TBIL0.042560.572.12.530.160.502.53
BMI0.03640.372.70.630.120.250.63
Glu0.011280.372.70.000.040.170.00
DBIL0.01640.573.10.380.040.220.38
BUN0.01640.573.0−0.130.030.15−0.13
TC0.01640.573.20.250.020.200.25
LDL0.011281.073.0−0.250.020.13−0.25
TG0.001281.072.9−0.130.020.15−0.13
Gender0.001281.073.00.130.000.180.13
Ca0.00640.573.00.000.000.170.00
Apo-A10.001281.073.20.250.000.200.25
HDL0.001281.073.1−0.130.000.15−0.13
Apo-B0.001281.073.10.000.000.170.00
PI0.001281.073.0−0.130.000.15−0.13

aThe full forms of all abbreviations are shown in Table 2.

In Table 6, “weight sum” indicates the sum of the weights calculated by the 3 filter feature selection models. In Table 7, columns 2 to 4 compose the contribution matrix C, whereas columns 5 to 7 compose the cumulative contribution matrix D. In Table 8, the results of the weighted sum of the features using WRHFS are sorted in decreasing order, with a larger weight indicative of higher importance.
Table 6

Weighting of the 3 feature selection models.

OrderFeatureaStandard deviationReliefInformation gainWeight sum
1α-HBD0.91231.00000.00011.9124
2GGP0.40000.06570.04980.5156
3Alb0.21980.05920.02110.3001
4LDL0.01970.00260.02360.0459
5TG0.01560.00020.00010.0159
6HDL0.00320.00000.00100.0042
7ALT0.31200.00550.11410.4316
8AST0.27340.03660.09850.4085
9SCr0.41420.06370.06380.5417
10CK1.00000.59190.05491.6468
11CK-MB0.23030.01900.16570.4150
12ALP0.72390.05090.10510.8799
13AGE0.35740.05031.00001.4077
14BUN0.02960.00050.08450.1146
15UA0.48170.00240.00370.4878
16LDH0.98840.95820.07882.0254
17Height0.81450.42400.12351.3621
18BMI0.11710.00110.21460.3328
19Gender0.00490.00000.13490.1398
20Ca0.00400.00000.00000.0040
21PI0.00000.00010.08120.0813
22Glu0.04300.00090.41540.4593
23Apo-A10.00360.00010.45250.4562
24Apo-B0.00130.00010.69870.7000
25DBIL0.03640.00030.26290.2996
26TC0.02480.00000.03820.0630
27TBIL0.16330.03230.51880.7143
28TP0.36670.09460.04170.5029

aThe full forms of all abbreviations are shown in Table 2.

Table 7

Contribution of individual features.

FeatureaContributionCumulative contribution
Standard deviationReliefInformation gainStandard deviationReliefInformation gain
α-HBD0.91231.00000.00011.66541.00006.2282
GGP0.40000.06570.04982.89872.50005.1229
Alb0.21980.05920.02114.94883.44285.8159
LDL0.01970.00260.02366.56556.27145.5703
TG0.01560.00020.00016.71557.88566.3685
HDL0.00320.00000.00107.41559.45716.0966
ALT0.31200.00550.11414.18216.07143.7369
AST0.27340.03660.09854.39885.00004.0439
SCr0.41420.06370.06382.76543.08574.8422
CK1.00000.59190.05491.00001.65714.9562
CK-MB0.23030.01900.16574.73215.84282.4474
ALP0.72390.05090.10512.03203.64283.8158
AGE0.35740.05031.00004.06544.64281.0000
BUN0.02960.00050.08456.23217.39994.1930
UA0.48170.00240.00372.16546.54285.9299
LDH0.98840.95820.07881.29871.65714.5878
Height0.81450.42400.12351.66541.77143.6492
BMI0.11710.00110.21465.69886.88572.3070
Gender0.00490.00000.13496.89889.21422.6492
Ca0.00400.00000.00007.06559.71426.5264
PI0.00000.00010.08127.73228.12854.3509
Glu0.04300.00090.41545.86557.14282.0614
Apo-A10.00360.00010.45257.26558.41421.8246
Apo-B0.00130.00010.69877.58228.72851.4386
DBIL0.03640.00030.26296.08217.62852.3070
TC0.02480.00000.03826.43228.95715.4387
TBIL0.16330.03230.51885.44885.38571.7018
TP0.36670.09460.04173.06542.04285.2808

aThe full forms of all abbreviations are shown in Table 2.

Table 8

Weighting of the 28 features using weighting- and ranking-based hybrid feature selection.

OrderFeatureaWeightContributionCumulative contributionWeight (0-1)
1Age176.310.130.131
2α-HBD88.360.060.190.42
3SCr83.020.060.250.38
4LDH70.590.050.300.30
5Height70.320.050.350.30
6TBIL66.180.050.390.27
7CK59.220.040.440.22
8Apo-B55.610.040.480.20
9CK-MB54.090.040.510.19
10Alb48.600.030.550.15
11AST47.490.030.580.15
12GGP45.360.030.610.13
13DBIL40.760.030.640.10
14Glu39.350.030.670.09
15Gender37.990.030.700.08
16ALP36.260.030.720.07
17Apo-A135.600.030.750.07
18TP35.220.030.770.06
19Ca34.340.020.800.06
20TC34.340.020.820.06
21BMI33.900.020.850.06
22HDL33.160.020.870.05
23BUN32.920.020.890.05
24PI32.610.020.920.05
25TG32.370.020.940.05
26UA30.700.020.960.03
27LDL27.460.020.980.01
28ALT25.560.021.000

aThe full forms of all abbreviations are shown in Table 2.

Table 9 presents the optimal performance of the trained risk detection models in terms of the 4 measures, including sensitivity (the positive detection rate), specificity (the negative detection rate), accuracy (the overall classification accuracy), and Youden index. From Table 9 it can be seen that the proposed WRHFS method achieved sensitivity of 82.7% (329/398) and classification accuracy of 81.5% (645/792) using only the top 9 features, and different classification models achieved the best performance when using different features. For example, information gain achieved the best classification accuracy of 72.5% (574/792) when using the top 10 features, and the accuracy began to decline when adding the eleventh feature. Similarly, standard deviation achieved the best classification accuracy of 73.2% (580/792) with the top 20 features presented in Table 5, and Relief achieved 72.9% (577/792) with the top 13 features. Therefore, we calculated sensitivity, specificity, accuracy, and Youden index of those methods by only using those optimal features that resulted in the best performed models. Among these feature selection methods, the proposed WRHFS method resulted in the highest performance measures with the fewest features. As shown in Table 8, Age, α-HBD, SCr, LDH, Height, TBIL, CK, Apo-B, and CK-MB are the top 9 most important features among the 28 features identified by WRHFS. Their cumulative contribution was 0.51. Table 10 presents the performances of models developed by different classifiers, as explained in Section “Performance Measures” using the same 9 features identified by WRHFS. Among all the models, SVM using WRHFS achieved the best performance in all 4 measures.
Table 9

Classification performances of support vector machine with different feature selection methods.

MethodFeaturesSensitivity (N=398), n (%)Specificity (N=394), n (%)Accuracy (N=792), n (%)Youden index
WRHFSa9329 (82.7)317 (80.4)645 (81.5)0.63
Information gain10297 (74.6)284 (72.1)574 (72.5)0.47
Relief13277 (69.6)290 (73.7)577 (72.9)0.43
Standard deviation20283 (71.1)291 (73.9)580 (73.2)0.45

aWRHFS: weighting- and ranking-based hybrid feature selection.

Table 10

Classification performances of different models with weighting- and ranking-based hybrid feature selection.

ClassifierSensitivity (N=398), n (%)Specificity (N=394), n (%)Accuracy (N=792), n (%)Youden index
SVMa329 (82.7)317 (80.4)645 (81.5)0.63
Bayes319 (80.2)197 (50.02)520 (65.7)0.30
CBAb305 (76.6)300 (76.1)605 (76.4)0.53
BPNNc280 (70.4)220 (55.8)501 (63.2)0.26
CARTd280 (70.4)283 (71.8)562 (71.0)0.42
C4.5269 (67.6)302 (76.6)571 (72.1)0.44
ELMe220 (55.3)249 (63.2)469 (59.2)0.19

aSVM: support vector machine.

bCBA: classification based on associations.

cBPNN: back-propagation neural networks.

dCART: classification and regression tree.

eELM: extreme learning machine.

24 blood test items. Descriptive statistics of age and gender of patients in the dataset (N=792) Effectiveness coefficients of the filter feature selection methods. Weighting of the 28 features based on standard deviation. aThe full forms of all abbreviations are shown in Table 2. Weighting of the 3 feature selection models. aThe full forms of all abbreviations are shown in Table 2. Contribution of individual features. aThe full forms of all abbreviations are shown in Table 2. Weighting of the 28 features using weighting- and ranking-based hybrid feature selection. aThe full forms of all abbreviations are shown in Table 2. Classification performances of support vector machine with different feature selection methods. aWRHFS: weighting- and ranking-based hybrid feature selection. Classification performances of different models with weighting- and ranking-based hybrid feature selection. aSVM: support vector machine. bCBA: classification based on associations. cBPNN: back-propagation neural networks. dCART: classification and regression tree. eELM: extreme learning machine. We visualized the change trend of the risk levels of ischemic stroke in Figure 2 using the surface fitting technique based on the 9 key features. The synthetic value (SV) indicates the linear combination of the feature value and its weight. The risk of ischemic stroke is reflected in the SV, which is defined as follows:
Figure 2

A surface chart for risk detection.

where age, α-HBD, and other features indicate the feature values, and 0.42, 0.38, and other values are the weights associated with individual features. Figure 2 presents the surface chart for stroke risk detection, in which the Y axis represents the age between 45 and 90 years, the Z axis represents risk index of suffering from ischemic stroke, and the X axis represents the SV. Figure 3 presents the risk index map for ischemic stroke detection, which is a top view of Figure 2. There were 33 ranks of risk index: “≤1.5” means no risk; “>1.5 but ≤2” means low risk; and “>2” means high risk. Different colors indicate different levels of risks.
Figure 3

A risk index map for ischemic stroke detection.

A surface chart for risk detection. A risk index map for ischemic stroke detection.

Discussion

To address the limitations of existing risk detection models and expensive detection costs in hospitals, we proposed a new feature selection method, namely WRHFS, for risk detection of ischemic stroke. In this study, WRHFS selected features through the guidance of the top 3 filter methods based on the 28 risk factors. It provided an aggregated importance weight for each feature. As shown in Table 8, the top 9 features that achieved sensitivity of 82.7% (329/398) were selected for detecting the risk of ischemic stroke. WRHFS can also evaluate the effectiveness of the existing filter feature selection methods based on effective coefficients and choose the top 3 filter methods. On the basis of the sorted results of the importance weights of individual features, we chose 9 features and produced the change trend of risk levels and the risk index map for ischemic stroke.

Principal Findings

We compared the performance of the proposed feature selection method WRHFS against those of standard deviation, Relief, and information gain. The results revealed that age is the most important influence factor because it has the largest weight value, which is consistent with the literature on stroke [6,13]. Through ranking the features in decreasing order of their importance to the performance of a risk detection model, WHRFS enables us to choose the most effective features that have the highest contributions to the performance measures of a model. Results of evaluation demonstrate that WRHFS can achieve a contribution rate of 0.51 with only the first 9 features, whereas the other 3 traditional feature selection methods require more. As a feature selection method, WRHFS is superior by being able to calculate effectiveness coefficients of individual features. The contributions of the other 27 risk factors, excluding age, vary in the models constructed by the 4 feature selection methods, including WRHFS, standard deviation, Relief, and information gain. More specifically, the contribution of α-HBD assessed by Relief is significantly greater than that assessed by other feature selection methods, and the contribution of CK was ranked highest by standard deviation but almost 0 by Relief. It implies that a single objective function may not be able to measure the importance of risk factors comprehensively. Age is the most important feature found in this study. The contribution of age to the model’s performance is approximately 13%. The risk of stroke was reflected in the SV. Therefore, age should be integrated in the SV. In general, the risk of ischemic stroke increases with age. As shown in Figure 2, A and B have the same age but are much older than C. However, B has higher risk than A because of the higher SV. In contrast, C is younger than A but has a higher risk than A also because of a higher SV. Therefore, the risk of ischemic stroke is influenced by the SV. A person would have low risk of ischemic stroke if the SV is far from the high-risk interval (HRI; ie, 1675, 2175), which is shown in Figure 3. The findings of this study will not only provide methodological guidance on how to select more effectiveness features for automated detection of stroke risk but also potentially help physicians improve their diagnosis in medical practice. The major contribution of this research is WRHFS, a new generic feature selection method. WRHFS deploys continuous weighting and ranking of individual features by following the principle of a wrapper approach that integrates the strengths of various filter methods for feature selection. The evaluation shows that WRHFS can result in a superior risk detection model that achieves better performance with fewer features than the existing feature selection methods, demonstrating the effectiveness of WRHFS. The findings of this study also provided multiple practical implications for physicians. First, the top 9 features are extremely easy to obtain. Physicians can calculate the corresponding SV and easily detect the ischemic stroke risk indexes using the risk index map as an auxiliary diagnostic method. As shown in Figure 3, the range between 1675 and 2175 of the SV (where the black arrow points) can be called the HRI. There seems a parabolic envelope curve. In addition, elderly people whose ages are between 70 and 90 years tend to have a high risk of ischemic stroke, whereas the risk becomes lower when the SV is smaller (800 to 1500) or larger (3000 to 3250). In addition, an automated stroke risk detection platform can be developed easily by use of the above findings for stroke during the physical examination of people.

Limitations

This study has a couple of limitations that offer future research opportunities. First, the acquisition of medical samples is very difficult. We were unable to find data samples that included all of the risk factors that have been discovered in the literature. It would be worthy to conduct a future study with a larger and different dataset with more features to examine if the finding of this research can still hold. Second, we used a straightforward way to aggregate the rankings of individual filter methods, which may or may not be optimal. We plan to explore other means in future research.

Conclusions

Automatic detection of stroke risks has been increasingly studied in recent years. How to select important factors for risk detection models is critical to the model’s performance. Existing research on automatic detection of stroke risks through machine learning faces a significant challenge in the selection of effective features as predictive cues. Therefore, how to develop more effective methods for feature selection is critical. This study proposed, developed, and evaluated a new feature selection method, which can help identify the most important features for building effective and parsimonious models for stroke risk detection. The proposed method, WRHFS, provides a novel methodological research contribution and practical implications.
  39 in total

1.  Variability of blood pressure in dialysis patients: a new marker of cardiovascular risk.

Authors:  Biagio Di Iorio; Lucia Di Micco; Serena Torraca; Maria Luisa Sirico; Pasquale Guastaferro; Luigi Chiuchiolo; Filippo Nigro; Antonietta De Blasio; Paolo Romano; Andrea Pota; Roberto Rubino; Luigi Morrone; Teodoro Lopez; Francesco Gaetano Casino
Journal:  J Nephrol       Date:  2013 Jan-Feb       Impact factor: 3.902

2.  Discriminative semi-supervised feature selection via manifold regularization.

Authors:  Zenglin Xu; Irwin King; Michael Rung-Tsong Lyu; Rong Jin
Journal:  IEEE Trans Neural Netw       Date:  2010-06-21

3.  2 × 2 Tables: a note on Campbell's recommendation.

Authors:  F M T A Busing; B Weaver; S Dubois
Journal:  Stat Med       Date:  2015-11-17       Impact factor: 2.373

4.  Bacterial meningitis complicating the course of liver cirrhosis.

Authors:  Pasquale Pagliano; Giovanni Boccia; Francesco De Caro; Silvano Esposito
Journal:  Infection       Date:  2017-06-14       Impact factor: 3.553

5.  Guidelines for cardiovascular risk assessment and cholesterol treatment.

Authors:  Donald M Lloyd-Jones; David C Goff; Neil J Stone
Journal:  JAMA       Date:  2014-06-04       Impact factor: 56.272

6.  High normal blood pressure is an independent risk factor for cardiovascular disease among middle-aged but not in elderly populations: 9-year results of a population-based study.

Authors:  F Hadaegh; R Mohebi; D Khalili; M Hasheminia; F Sheikholeslami; F Azizi
Journal:  J Hum Hypertens       Date:  2012-01-05       Impact factor: 3.012

7.  Effect of elevated total cholesterol level and hypertension on the risk of fatal cardiovascular disease: a cohort study of Chinese steelworkers.

Authors:  Ying Yang; Jian-Xin Li; Ji-Chun Chen; Jie Cao; Xiang-Feng Lu; Shu-Feng Chen; Xi-Gui Wu; Xiu-Fang Duan; Xing-Bo Mo; Dong-Feng Gu
Journal:  Chin Med J (Engl)       Date:  2011-11       Impact factor: 2.628

8.  Influence of socioeconomic status on acute myocardial infarction in the Chinese population: the INTERHEART China study.

Authors:  Jin Guo; Wei Li; Yang Wang; Tao Chen; Koon Teo; Li-Sheng Liu; Salim Yusuf
Journal:  Chin Med J (Engl)       Date:  2012-12       Impact factor: 2.628

9.  Predicting the 30-year risk of cardiovascular disease: the framingham heart study.

Authors:  Michael J Pencina; Ralph B D'Agostino; Martin G Larson; Joseph M Massaro; Ramachandran S Vasan
Journal:  Circulation       Date:  2009-06-08       Impact factor: 29.690

10.  Application of systematic coronary risk evaluation chart to identify chronic myeloid leukemia patients at risk of cardiovascular diseases during nilotinib treatment.

Authors:  Massimo Breccia; Matteo Molica; Irene Zacheo; Alessandra Serrao; Giuliana Alimena
Journal:  Ann Hematol       Date:  2014-10-12       Impact factor: 3.673

View more
  3 in total

1.  Use of a Smartphone Platform to Help With Emergency Management of Acute Ischemic Stroke: Observational Study.

Authors:  Yiqun Wu; Fei Chen; Haiqing Song; Wuwei Feng; Jinping Sun; Ruisen Liu; Dongmei Li; Ying Liu
Journal:  JMIR Mhealth Uhealth       Date:  2021-02-09       Impact factor: 4.773

Review 2.  Artificial Intelligence: A Shifting Paradigm in Cardio-Cerebrovascular Medicine.

Authors:  Vida Abedi; Seyed-Mostafa Razavi; Ayesha Khan; Venkatesh Avula; Aparna Tompe; Asma Poursoroush; Alireza Vafaei Sadr; Jiang Li; Ramin Zand
Journal:  J Clin Med       Date:  2021-12-06       Impact factor: 4.241

3.  Using machine learning models to improve stroke risk level classification methods of China national stroke screening.

Authors:  Xuemeng Li; Di Bian; Jinghui Yu; Mei Li; Dongsheng Zhao
Journal:  BMC Med Inform Decis Mak       Date:  2019-12-10       Impact factor: 2.796

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.