Literature DB >> 35251161

Predicting Chronic Kidney Disease Using Hybrid Machine Learning Based on Apache Spark.

Manal A Abdel-Fattah1, Nermin Abdelhakim Othman1,2, Nagwa Goher1,3.   

Abstract

Chronic kidney disease (CKD) has become a widespread disease among people. It is related to various serious risks like cardiovascular disease, heightened risk, and end-stage renal disease, which can be feasibly avoidable by early detection and treatment of people in danger of this disease. The machine learning algorithm is a source of significant assistance for medical scientists to diagnose the disease accurately in its outset stage. Recently, Big Data platforms are integrated with machine learning algorithms to add value to healthcare. Therefore, this paper proposes hybrid machine learning techniques that include feature selection methods and machine learning classification algorithms based on big data platforms (Apache Spark) that were used to detect chronic kidney disease (CKD). The feature selection techniques, namely, Relief-F and chi-squared feature selection method, were applied to select the important features. Six machine learning classification algorithms were used in this research: decision tree (DT), logistic regression (LR), Naive Bayes (NB), Random Forest (RF), support vector machine (SVM), and Gradient-Boosted Trees (GBT Classifier) as ensemble learning algorithms. Four methods of evaluation, namely, accuracy, precision, recall, and F1-measure, were applied to validate the results. For each algorithm, the results of cross-validation and the testing results have been computed based on full features, the features selected by Relief-F, and the features selected by chi-squared feature selection method. The results showed that SVM, DT, and GBT Classifiers with the selected features had achieved the best performance at 100% accuracy. Overall, Relief-F's selected features are better than full features and the features selected by chi-square.
Copyright © 2022 Manal A Abdel-Fattah et al.

Entities:  

Mesh:

Year:  2022        PMID: 35251161      PMCID: PMC8890824          DOI: 10.1155/2022/9898831

Source DB:  PubMed          Journal:  Comput Intell Neurosci


1. Introduction

The present era, especially the last two decades, can be named the era of big data where digital data is turning out to be very crucial more and more in various fields such as science, healthcare, technology, and society. Huge data volumes have been produced and generated from multiple sensor networks and mobile applications in almost all fields, including healthcare in specific, and this multitude of data volumes is what we call big data [1]. Wide variety of data sources such as streaming machines, high-end output instruments, visualizing, and knowledge extraction across these vast and diverse types of data pose a significant challenge when sufficient cutting-edge technologies and tools are not used. One of the most eminent technological challenges facing big data analytics lays in exploring ways that are adequate to obtain useful and relevant information for different user categories in an effective manner. Nowadays, the different forms and types of data sources in healthcare are being gathered in both clinical and nonclinical environments, where the most crucial data in healthcare analytics is the digital copy of a patient's medical history. On that account, the process of designing and making up a distributed data system to handle big data is challenged by three main issues. The first challenge is that it is difficult to collect data from distributed locations because of the diverse and large data volume. The second challenge is that storage is the chief issue for heterogeneous and enormous datasets as big data system requires to store while allowing performance guarantee. The third challenge is more connected to big data analytics, specifically to enormous mining datasets in real time, and this includes visualization, prediction, and optimization [2]. Considering the difficulty imposed by these challenges, they require an up-to-date and advanced processing paradigm provided that the present data management systems do not provide adequate efficiency in handling the heterogeneous nature of data or the real-time aspect. Traditional database management systems cannot support the continuous increase in huge data size. To address these issues related to enormous and heterogeneous data storage, the research community has proposed a number of research works, such as Apache Spark, Apache Hadoop [3], Apache Kafka [4], and Apache Storm [5], to solve healthcare problems [6-8]. Chronic kidney disease (CKD) has received a lot of interest due to its high death rate. Chronic diseases have become a major hazard to emerging countries, according to the World Health Organization (WHO) [9]. CKD is a kidney illness that can be treated in its early stages, but it eventually leads to renal failure if not treated early. In 2016, chronic kidney disease claimed the lives of 753 million individuals globally, accounting for 336 million male deaths and 417 million female deaths [10]. Chronic renal disease can be prevented from progressing to kidney failure if diagnosed and treated early. Diagnosing chronic kidney disease early is the best method to treat it, while delaying treatment until it is too late may lead to renal failure, which necessitates dialysis or kidney transplantation to live normally. Therefore, global strategies for early detection and treatment of people with CKD are required. To mine hidden patterns from data for effective decision-making and to help doctors in making more accurate diagnoses, a computer-aided diagnosis system based on artificial intelligence strategies is needed for clinical information. Artificial intelligence techniques (machine learning and deep learning) have been used in the health field, namely, in disease prediction and diagnosis. Chronic kidney disease (CKD) is a condition that affects the kidney's ability to function. In general, CKD is separated into phases, with renal failures occurring when the kidneys are no longer able to complete their roles of blood purification and mineral balance in the body [11]. According to the current estimates, CKD is more common in adults over 65 years old (38%) than in people aged 45–64 years (12%) and people aged 18–44 years (6%). Women have a rather higher rate of CKD (14%) than males [12]. Machine learning is an exciting field that focuses on studying huge amounts of data with multiple variables. Machine learning has basically developed from studying the theory of pattern recognition and computational learning in artificial intelligence; it presupposes computational methods, algorithms, and analysis techniques. From the perspective of Medical Sciences, machine learning undertakes to aid health specialists and doctors in carrying out scintillate and flawless diagnoses, choosing the best-fit medicines for patients, determining patients at high risk, and, most importantly, improving patients' physical condition with minimal cost. Machine learning (ML) has demonstrated remarkable performance across a range of applications, such as speech recognition [13], computer vision [14], medical diagnostics [15], and engineering [16]. Being a constituent of the ML process, feature selection (FS) is a crucial preprocessing step that determines the most relevant attributes within a dataset. Removing unimportant and unnecessary attributes can result in less complicated and more accurate models. In this paper, two feature selection methods based on Apache Spark are used, namely, Relief-F [17] and chi-squared [18] feature selection method. Some of the research works have used ML techniques to predict CKD. For example, Charleonnan [19] et al. used four ML algorithms, K-nearest neighbors (KNN), support vector machine (SVM), logistic regression (LR), and decision tree (DT), to predict CKD. Other research works used hybrid ML algorithms that are integrated between feature selection methods and ML to predict CKD. Feature selection methods have been used to reduce the number of features and select the optimal subsets of features from the dataset. For example [20], authors used chi-square, correlation-based feature selection (CFS), and Lasso feature selection to select the essential features from the database. They applied artificial neural network (ANN), C5.0, LR, SVM, KNN, and RF to both full features and the selected features. Recently, researchers have been using big data platforms such as Apache Spark [21] which is a large-scale data processing engine with a unified analytics engine. Spark is 100 times quicker than Hadoop in running workloads on large-scale clusters. It includes Java, Scala, Python, and R high-level APIs, as well as an efficient engine that supports broad execution graphs. It also includes a number of higher-level tools such as Spark SQL for SQL and structured data processing, MLlib, GraphX, and Structured Streaming. Spark's machine learning (ML) [21] library is called MLlib. Its purpose is to make scalable and simple machine learning a reality. It provides, at a high level, tools such as classification, regression, clustering, and collaborative filtering as examples of machine learning algorithms. It also provides feature extraction, transformation, dimensionality reduction, and selection as examples of featurization. The previous studies of CKD prediction have not used big data platforms to solve this problem. The goal of this work is to predict CKD using hybrid ML techniques based on Apache Spark to predict CKD. Our contribution can be summarized as follows: Developing hybrid ML techniques based on Apache Spark to predict CKD Applying feature selection algorithms to select the important features from the dataset Applying optimization techniques, including grid search with cross-validation to optimize ML algorithms to enhance performance Applying different ML classification algorithms to both full features and the selected features Applying ensemble learning such as Gradient-Boosted Trees based on Apache Spark to predict CKD. The rest of this paper is structured as follows: Section 2 presents the previous studies to predict CKD. Section 3 presents the main stages of a developing system to predict CKD based on Apache Spark. Section 4 presents the experimental results. Finally, conclusions are presented in Section 5.

2. Related Works

Many authors have used different ML techniques for the diagnosis and prediction of chronic kidney disease as shown in Table 1.
Table 1

Related works for prediction of CKD.

REFYearModelsFeature selection methodsDataset

[22]2021SVM, KNN, DT, and RFRecursive feature elimination (RFE)CKD dataset
[20]2020ANN, C5.0, and LRCFS, Lasso, andCKD dataset
LSVM, KNN, and RFWrapper method
[23]2020RF, SVM, NB, and LRRF-FS, FS, FES, BS, and BESCKD dataset
[24]2020An ensemble of decision tree modelsCost-sensitive ensembleCKD dataset
Feature ranking
[25]2020Bagging and random subspaceNoCKD dataset
Methods based on KNN
NB and DT
[26]2020Decision Table, J48Genetic search algorithmCKD dataset
MLP and NB
[27]2019LR, RF, SVM, KNNNoCKD dataset
NB and FNN
A hybrid model LR and RF
[28]2019Artificial neural network (ANN) and SVMCorrelation coefficientsCKD dataset
[29]2018NB and K-StarNoCKD dataset
SVM
J48
[30]2018AdaBoost and KNNCFSCKD dataset
NB and SVM
For example, in [27], the authors proposed a hybrid model that combines LR and RF to predict CKD disease. They compared their proposed model with six ML algorithms, LR, RF, SVM, KNN, Naive Bayes (NB), and feedforward neural network (FNN). Their proposed model has registered the highest accuracy at 99.83%. In [29], NB, K-Star, SVM, and J48 classifiers were used to predict CKD. Performance comparison was made using WEKA software. J48 algorithm had better performance with 99% accuracy than the other algorithms. Some authors used ML algorithms with feature selection methods to predict CKD. In [22], the recursive feature elimination (RFE) feature selection method has been used to select the essential features from the chronic kidney disease (CKD) dataset. Four classification algorithms have been applied (SVM, KNN, DT, and RF) to both full features and selected features. The results showed that RF outperformed all other algorithms. In [20], the authors used chi-square, CFS, and Lasso feature selection to select the essential features from the database. They applied ANN, C5.0, LR, LSVM, KNN, and RF to both full features and the selected features. The results showed that LSVM with full features has registered the highest accuracy at 98.86%. In [23], five feature selection methods, Random Forest feature selection (RF-FS), forward selection (FS), forward exhaustive selection (FES), backward selection (BS), and backward exhaustive (BE), have been used to select the most important features from the database. Four ML algorithms, RF, SVM, NB, and LR, have been used to predict CKD. The results showed that RF with Random Forest feature selection had achieved the best performance with 98.8% accuracy. In [26], the genetic search algorithm has been used to select the most important features from the CKD dataset. Decision Table, J48, Multilayer Perceptron (MLP), and NB have been applied to both full features and the selected features. Using genetic search algorithm enhanced the performance. The MLP classifier has achieved the best performance and outperformed the other classifiers. In [30], the number of important features has been selected using a correlation-based feature selection (CFS). AdaBoost, KNN, NB, and SVM have been used to detect CKD. The proposed CFS with AdaBoost achieved the best performance at 98.1% accuracy. In [25], the authors used two ensembles techniques which are Bagging and Random Subspace methods and three base-learners, KNN, NB, and DT, to predict CKD. The random subspace has achieved the best performance than Bagging on KNN classifier. Previous studies just applied ML techniques to study and analyze data about CKD; they did not use big data platforms. Therefore, this motivates us to use big data platform (Spark) to study and analyze data about CKD including hybrid approaches (feature selection methods with ML classification algorithms and feature selection methods with ensemble algorithms).

3. Methodology

The proposed system of predicting chronic kidney disease consists of two main approaches, as shown in Figure 1. The first approach uses feature selection methods to select the essential features from the chronic kidney disease datasets. The second approach applies ML techniques: DT, LR, RF, SVM, NB, and ensemble learning on the selected features and full features to predict CKD. The proposed system is composed of 6 steps: in the first step (data collection), the CKD dataset from the UCI machine learning repository will be used. In the second step (data preprocessing step), null values will be handled. In the third step, the feature methods will be used to select the essential features. In the fourth step, a grid search with stratified cross-validation is used to optimize the parameters of ML and ensemble learning techniques. Each step is described in detail in the following subsections.
Figure 1

The steps of prediction CKD based on Apache Spark.

3.1. Data Collection

The chronic kidney disease (CKD) dataset used in this study was obtained from the UCI machine learning repository [31]. The CKD dataset includes 400 samples, 24 features, and 1 class label. The dataset contains 400 samples. The class label has two values: ckd (sample with CKD) and notckd (sample without CKD). The details of each feature are described in Table 2.
Table 2

The CKD dataset description.

FeaturesExplain

ageAge
bpBlood pressure
sgSpecific gravity
alAlbumin
suSugar
rbcRed blood cells
pcPus cell
pccPus cell clumps
baBacteria
bgrBlood glucose random
buBlood urea
scSerum creatinine
sodSodium
potPotassium
hemoHemoglobin
pcvPacked cell volume
wcWhite blood cell count
rcRed blood cell count
htnHypertension
dmDiabetes mellitus
cadCoronary artery disease
appetAppetite
pePedal edema
aneAnemia
classClass

3.2. Data Preprocessing

The dataset included outliers and noise. Therefore, it needs to be cleaned up and unblemished in a preprocessing stage. The preprocessing stage incorporated the estimation of the missing values and noise elimination, like outliers, normalization, and unbalanced data checking, because certain measures may be lost when patients are being tested, resulting in missing values. There were 158 completed cases in the dataset, with the remainder occurrences having missing values. Ignoring the record is the simplest way of dealing with the missing values, although this strategy is ineffective in small datasets. Instead of removing records, we can apply algorithms to estimate the missing data as an alternative approach. The missing values of nominal features have been filled by mode. The missing values of numerical features have been filled by mean.

3.3. Feature Selection Methods

The main benefits of using feature selection algorithms are determining the important features in the dataset. The classifier approach with feature selection produces better results and reduces the model's execution time. Relief-F and chi-squared feature selection method were used to select the subset of important features from the database. This study has applied two feature selection strategies based on Apache Spark. RelieF [32] is a frequently used feature weighting technique that assigns weights to each feature in a dataset to determine the quality of the features [33] A chi-squared test is used a statistical hypothesis test to get ranks for each feature [18]

3.4. Splitting the Dataset

The CKD datasets are split into 80% training set and 20% testing set. We used stratified cross-validation to train and optimize the models using the training set and the result of cross-validation is registered. We evaluated the models using the testing set, and the results of the testing set are registered.

3.5. Models' Optimization and Training

3.5.1. Optimization Methods

Grid search with stratified K-Fold cross-validation is used to optimize the models and tune the hyperparameters. The most common method for hyperparameter optimization is grid search. For each hyperparameter, the users must first define a set of values. The model then evaluates all possible values for each hyperparameter and chooses the one that provides the best performance. K-Fold cross-validation: the dataset is divided into k folds of equal size. The training is done in k-1 groups, with the remaining time being used to test the classifiers. This procedure is repeated until each of the ten folds has been provided as a testing set. The performance of the classifiers is also measured for each k. Finally, depending on the average performance, the evaluation classifier is created.

3.5.2. Machine Learning Models

The classification models used in the research are as follows: Decision tree (DT): it could be a supervised rule for learning in classification issues that contains a predefined target variable which is generally used. Decision tree works for each specific and continuous input and output variables. During this methodology, decision tree will be applied to each classification and regression issue that divides the population or sample into two or additional same sets known as subpopulation supporting the foremost necessary splitter within the input variable [34]. Random forest (RF): it is a type of supervised ML technique. Basically, it accumulates a lot of trees and integrates them for more accurate prediction [23]. Logistic regression (LR): it solved binary classification problems. A logistic or sigmoid function is used in LR to predict the probabilities of various labels for an unlabeled observation [35]. Support vector machine (SVM): it is a type of supervised ML technique. It segregates dataset into classes using the hyperplane [22]. Naïve Bayes (NB): the Bayes theorem is used to train a classifier in the Nave Bayes algorithm. In other words, it is a probabilistic classifier that has been trained using the Nave Bayes algorithm. It calculates a probability distribution over a set of classes for a given observation [29]. Gradient-Boosted Trees (GBTs): it is also possible to train an ensemble of decision trees using the Gradient-Boosted Trees (GBTs) algorithm. However, each decision tree is trained sequentially. This makes use of the previously trained tree information to optimize each new tree. As a result, the model improves with every new tree. Since GBT trains one tree at a time, it can take longer time to train a model using GBT. In addition, if many trees are used in an ensemble, it is prone to overfitting. In a GBT ensemble, each tree can, however, be shallow, making it easier to train. Gradient boosting is a technique for iteratively training a series of decision trees. On each iteration, the method predicts the label of each training sample using the current ensemble and then compares the prediction to the true label [36].

3.6. Evaluating the Models

As shown in Equations 1-4, the models are evaluated using four standard metrics: accuracy, precision, recall, and F1-score, where TP stands for true positive, TN stands for true negative, FP stands for false positive, and FN stands for false negative.

4. Experiments and Results

This section discusses the results of applying chi-square and Relief-F to the dataset to select the most important features. Also, it discusses the performance of cross-validation and the testing results of applying ML algorithms, SVM, LR, NB, RF, DT, and GBT Classifier, to the full features and the selected features. In addition, it demonstrates the best values of parameters for each ML algorithm that was optimized by grid search. Two feature selection methods were used; the CKD dataset was split into 80% training set and 20% testing set. The cross-validation results were registered for the training set, and the testing results were registered for the testing set. ML algorithms and features selection methods were implemented using PySpark.

4.1. Results of Chi-Square Feature Selection Method and ML Algorithms

In this subsection, the essential features were selected by chi-square algorithm to pass into ML models for predicting CKD. The 12 most important features which have the highest scores and were thus used to predict CKD chi-square are wc, bgr, bu, sc, pcv, al, haem, age, su, htn, dm, and bp, as shown in Figure 2. It can be noticed that wc has the highest score at 12733.72, while bp has the lowest score at 80.02. The second highest score is registered by bgr at 2428.327. Sc and pcv have the same score at 354.410 and 324.706, respectively. Also, htn and dm have approximately the same score at 86.29 and 80.44, respectively. Table 3 displays the scores of all features that chi-square has selected. The highest score is registered by wc at 12733.72, while the lowest is registered by sg at 0.0050.
Figure 2

The important features selected by chi-square.

Table 3

The scores of all features that are selected by chi-square.

FeaturesScores

wc12 733.72
bgr2428.327
bu2336.00
sc354.410
pcv324.706
al228.104
haem125.065
age113.460
su100.94
htn86.29
dm80.44
bp80.02
pe45.10
ane35.611
sod28.793
pcc24.075
rc20.84
cad19.93
pc14.16
ba12.58
appe12.58
rbc9.41
pot4.07
sg0.0050
The performance of cross-validation and the testing results of applying ML to the selected features by chi-square are described in Table 4. For cross-validation result, RF registered the highest performance (AC = 100%, PR = 100%, RE = 100%, FS = 100%), while NB has registered the lowest performance (AC = 81%, PR = 85%, RE = 82%, FS = 82%). LR and SVM have the same performance (AC = 97%, PR = 97%, RE = 97%, FS = 97%). For the testing results, SVM registered the highest performance (AC = 100%, PR = 100%, RE = 100%, FS = 100%), while NB registered the lowest performance (AC = 82%, PR = 88%, RE = 82%, FS = 82%). The second highest performance is registered by LR (AC = 97%, PR = 98%, RE = 97%, FS = 97%).
Table 4

The performance of ML with the features selected by chi-square.

ModelsCross-validation performanceTest performance
ACPRREFSACCPRREFS

DT9798989892939393
RF10010010010095959595
LR9797979797989797
SVM97979797100100100100
NB8185828282888282
GBT Classifier9898989895959595
For optimization ML models, some of values of parameters are adapted and the best setting of ML's parameters is shown in Table 5.
Table 5

The best values of ML's parameters are applied to the features selected by chi-square.

ModelParametersValues

DTImpurityGini
maxDepth3
maxBins10

RFImpurityGini
maxDepth6
maxBins32

LRregParam0.8
maxIter20

SVMregParam0.01
maxIter100
NBSmoothing0.2

GBT ClassifiermaxDepth2
maxBins60

4.2. Results of Relief-F Feature Selection Method and ML Algorithms

In this subsection, the essential features were selected by Relief-F algorithm to pass into Ml models for predicting CKD. The 12 most important features which have the highest weights selected by Relief-F and were used to predict CKD are shown in Figure 3. It can be noticed that rbc has the highest weight at 0.4551, while appe has the lowest weight at 0.062875. The second highest weight is registered by haem at 0.365745. Al and dm have approximately the same weights at 0.257775 and 0.24085, respectively.
Figure 3

The weights of the most essential selected by Relief-F.

Table 6 displays weights of all features that are selected by Relief-F. The highest weight is registered by rbc at 0.4551, while the lowest weight is registered by bp at -0.01584. The performance of cross-validation and the testing results of applying ML to the features selected by Relief-F are described in Table 7. For cross-validation results, DT, RF, and GBT Classifier registered the highest performance (AC = 100%, PR = 100%, RE = 100%, FS = 100%), while NB registered the lowest performance (AC = 88%, PR = 89%, RE = 89%, FS = 89%). LR and SVM have the same performance (AC = 99%, PR = 99%, RE = 99%, FS = 99%).
Table 6

The performance of ML with the features selected by Relief-F.

ModelsCross-validation performanceTest performance
ACPRREFSACPRREFS

DT100100100100100100100100
RF10010010010098999999
LR9999999998999999
SVM9999999998999999
NB8889898995959595
GBT Classifier100100100100100100100100
Table 7

The best values of ML's parameters which are applied to the features selected by Relief-F.

ModelParametersValues

DTImpurityGini
maxDepth4
maxBins32

RFImpurityGini
maxDepth5
maxBins32

LRregParam0.1
maxIter20

SVMregParam0.01
maxIter100
NBSmoothing0.1

GBT ClassifiermaxDepth4
maxBins20
For the testing results, DT and GBT Classifier registered the highest performance (AC = 100%, PR = 100%, RE = 100%, FS = 100%), while NB registered the lowest performance (AC = 95%, PR = 95%, RE = 95%, FS = 95%). LR and SVM have the same performance (AC = 98%, PR = 99%, RE = 99%, FS = 99%). For optimization ML models, some of values of parameters are adapted and the best setting of ML's parameters is shown in Table 8.
Table 8

The weights of all features that are selected by Relief-F.

FeaturesWeights

rbc0.455 1
haem0.365 745
pcv0.311 56
sg0.289 825
htn0.275 375
al0.257 775
dm0.240 85
rc0.160 433
pc0.136 225
sod0.104 587
Age0.065 923
appe0.062 875
pe0.056 825
su0.031 65
bgr0.029 549
ane0.027
bu0.022 733
sc0.015 806
pcc0.015 675
wc0.006 426
ba−0.000 12
pot−0.004 11
cad−0.011 97
bp−0.015 84

4.3. The Performance of ML with Full Features

Table 9 presents the result of cross-validation and the testing of applying ML to full features. Overall, RF achieved the best performance for cross-validation and the testing results. For cross-validation results, RF registered the highest performance (AC = 100%, PR = 100%, RE = 100%, FS = 100%), while NB has the lowest performance (AC = 84%, PR = 88%, RE = 84%, FS = 84%). LR, SVM, and GBT Classifier have the same performance (AC = 99%, PR = 99%, RE = 99%, FS = 99%). For the testing results, RF and SVM registered the highest performance (AC = 100%, PR = 100%, RE = 100%, FS = 100%), while NB has the lowest performance (AC = 87%, PR = 91%, RE = 88%, FS = 88%). For optimization ML models, some of values of parameters are adapted and the best setting of ML's parameters is shown in Table 10.
Table 9

The performance of ML with full features.

ModelsCross-validation performanceTest performance
ACPRREFSACPRREFS
DT98.4398989895959595
RF100100100100100100100100
LR9999999998999999
SVM99999999100100100100
NB8488848487918888
GBT Classifier9999999995959595
Table 10

The best values of ML's parameters which are applied to full features.

ModelParametersValues

DTImpurityGini
maxDepth4
maxBins10

RFImpurityGini
maxDepth7
maxBins32

LRregParam0.3
maxIter10

SVMregParam0.01
maxIter1000
NBSmoothing0.2

GBT ClassifiermaxDepth2
maxBins60

4.4. Discussion

Table 11 presents models that have achieved the highest cross-validation results. The performance of cross-validation of applying ML to the features selected by Relief-F has achieved the best value by three models: DT, RF, and GBT Classifiers. In comparisons, the cross-validation performance of applying ML to full features and features selected by chi-square has achieved the best value by 1 model: RF.
Table 11

Best models for cross-validation results.

Best modelsFeaturesMeasure methods
ACPRREFS

RFFull features100100100100
RFFeatures selected by chi-square100100100100
DTFeatures selected by Relief-F100100100100
RFFeatures selected by Relief-F100100100100
GBT ClassifierFeatures selected by Relief-F100100100100
Table 12 presents the best models for the testing results. The performance of testing applying ML to the features selected by Relief-F has achieved the best value by two models: DT and GBT Classifiers. The testing performance of applying ML to full features has achieved the best value by two models: RF and SVM Classifiers. However, the testing performance of applying ML to the features selected by chi-square has achieved the best value by 1 model: SVM Classifier.
Table 12

Best models for the testing results.

Best modelsFeaturesMeasure methods
ACPRREFS

SVMFull features100100100100
RFFull features100100100100
SVMFeatures selected by chi-square100100100100
DTFeatures selected by Relief-F100100100100
GBT ClassifierFeatures selected by Relief-F100100100100
The results showed that SVM, DT, and GBT Classifier with the selected features have achieved the best performance. Overall, the performance with Relief-F feature selection is better than chi-square feature selection and full features. Table 13 presents the comparison of performance between the previous studies and our work on the same dataset. In our work, the Relief-F feature selection methods have achieved the best performance for the testing results and cross-validation results using DT and GBT Classifier compared to the other existing works [23, 24, 26, 27, 30]. Also, our work is different from the other existing works [22, 25] because it registered the results for both the training set and the testing set, and it has achieved the best performance.
Table 13

The comparison of performance between the previous studies and our work on the same dataset.

REFFeature selection methodsThe best modelDatasetResult

[22]RFERFCKD datasetAC = 100%
PR = 100%
RE = 100%
FS = 100%
[27]NoA hybrid model LR and RFCKD datasetAC = 99.94%
E = 99.84%
S = 99.80%
[30]CFSAdaBoost based on KNNCKD datasetAC = 98.1%
PR = 98%
RE = 98%
FS = 98%
[23]Rffs, FS, FES, BS, BESRFCKD datasetAC = 98.825%
RE = 98.04%
[24]Cost-sensitive ensemble feature rankingAn ensemble of decision tree modelsCKD datasetAC = 97.27%
PRC = 99.44%
RE = 96.25%
FS = 97.68%
[25]NoRandom subspace-based KNNCKD datasetAC = 100%
RE = 100%
[26]Genetic search algorithmMultilayer perceptronCKD datasetAC = 99.75%
Our workRelief-FDTCKD datasetCross-validation result AC = 100%, PRC = 100%, RRE = 100% FS = 100% result of testing AC = 100%, PRC = 100%, RRE = 100%, FS = 100%
GBT ClassifierCKD datasetCross-validation result AC = 100%, PRC = 100%, RRE = 100%, FS = 100%; result of testing AC = 100%, PRC = 100%, RRE = 100%, FS = 100%

5. Conclusion

In this paper, the hybrid ML techniques integrating feature selection methods and classification ML algorithms based on big data platforms (Apache Spark) were used to predict CKD. Relief-F and chi-squared feature selection techniques were used to select the important features from the dataset. ML algorithms, DT, LR, NB, RF, SVM, and GBT Classifier as ensemble learning algorithm, were applied to benchmark chronic kidney disease dataset. Also, they were applied to the full features and to the selected features. Grid search with cross-validation was used to optimize the parameters of ML. In addition. Four methods of evaluation, accuracy, precision, recall, and F1-measure, were applied to validate the results and the results of cross-validation and the testing data were registered. The results showed that SVM, DT, and GBT Classifier with the selected features have achieved the best performance. Overall, the performance of Relief-F feature selection is better than that achieved by chi-square feature selection and the full features.
  6 in total

1.  Disparities in Chronic Kidney Disease Prevalence among Males and Females in 195 Countries: Analysis of the Global Burden of Disease 2016 Study.

Authors:  Boris Bikbov; Norberto Perico; Giuseppe Remuzzi
Journal:  Nephron       Date:  2018-05-23       Impact factor: 2.847

2.  KDIGO clinical practice guideline for the care of kidney transplant recipients.

Authors: 
Journal:  Am J Transplant       Date:  2009-11       Impact factor: 8.086

3.  A hierarchical method based on weighted extreme gradient boosting in ECG heartbeat classification.

Authors:  Haotian Shi; Haoren Wang; Yixiang Huang; Liqun Zhao; Chengjin Qin; Chengliang Liu
Journal:  Comput Methods Programs Biomed       Date:  2019-02-20       Impact factor: 5.428

4.  Big Data and Machine Learning in Health Care.

Authors:  Andrew L Beam; Isaac S Kohane
Journal:  JAMA       Date:  2018-04-03       Impact factor: 56.272

5.  Neural network and support vector machine for the prediction of chronic kidney disease: A comparative study.

Authors:  Njoud Abdullah Almansour; Hajra Fahim Syed; Nuha Radwan Khayat; Rawan Kanaan Altheeb; Renad Emad Juri; Jamal Alhiyafi; Saleh Alrashed; Sunday O Olatunji
Journal:  Comput Biol Med       Date:  2019-04-25       Impact factor: 4.589

6.  Diagnosis of Chronic Kidney Disease Using Effective Classification Algorithms and Recursive Feature Elimination Techniques.

Authors:  Ebrahime Mohammed Senan; Mosleh Hmoud Al-Adhaileh; Fawaz Waselallah Alsaade; Theyazn H H Aldhyani; Ahmed Abdullah Alqarni; Nizar Alsharif; M Irfan Uddin; Ahmed H Alahmadi; Mukti E Jadhav; Mohammed Y Alzahrani
Journal:  J Healthc Eng       Date:  2021-06-09       Impact factor: 2.682

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.