Literature DB >> 34322636

Prediction of Multiple sclerosis disease using machine learning classifiers: a comparative study.

Sonia Darvishi¹, Omid Hamidi², Jalal Poorolajal³.

Abstract

INTRODUCTION: Hamedan Province is one of Iran's high-risk regions for Multiple Sclerosis (MS). Early diagnosis of MS based on an accurate system can control the disease. The aim of this study was to compare the performance of four machine learning techniques with traditional methods for predicting MS patients.
METHODS: The study used information regarding 200 patients through a case-control study conducted in Hamadan, Western Iran, from 2013 to 2015. The performance of six classifiers was used to compare their performance in terms of sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), positive likelihood ratio (LR+), negative likelihood ratio (LR-) and total accuracy.
RESULTS: Random Forest (RF) model illustrated better performance among other models in both scenarios. It had greater specificity (0.67), PPV (0.68) and total accuracy (0.68). The most influential diagnostic factors for MS were age, birth season and gender.
CONCLUSIONS: Our findings showed that despite all the six methods performed almost similarly, the RF model performed slightly better in terms of different criteria in prediction accuracy. Accordingly, this approach is an effective classifier for predicting MS in the early stage and control the disease. ©2021 Pacini Editore SRL, Pisa, Italy.

Entities: Chemical

Keywords: Classification; Machine learning; Multiple sclerosis

Mesh：

Year: 2021 PMID： 34322636 PMCID： PMC8283630 DOI： 10.15167/2421-4248/jpmh2021.62.1.1651

Source DB: PubMed Journal: J Prev Med Hyg ISSN： 1121-2233

Introduction

Multiple sclerosis (MS) is a chronic autoimmune inflammatory disease related to the central nervous system (the brain and spinal cord) without clear etiology. Focal lymphocytic infiltration in MS leads to the destruction of myelin and axons [1]. The onset of the disease happens in young adults with the most susceptibility is related to people who are in their 20s and 30s [2, 3]. Approximately 2.5 million people are affected by MS worldwide annually [4]. The prevalence of MS varies geographically between 5.3 and 74.28 per 100,000 in Iran [2]. It has been shown by epidemiological studies that the trend of MS, especially in women, is increasing [2, 5, 6]. Hamadan Province, located in western Iran, is among the most high-risk regions in Iran with the prevalence of 62.5/100,000 [2]. The burden of MS disease for the public health systems and its prevalence during have been increased the past years [7]. Therefore, identifying the most important factors related to the MS is of great importance. Many epidemiologic studies showed that MS has a multifactorial etiology that corresponds to several environmental factors for people who have complex genetic risk profiles [8]. These factors include both genetic and non-genetic exposure to dietary patterns [9], infectious agents [9], familial clustering [10], season of the birth [11], age infection during childhood [11, 12], smoking [13], environment exposures [14], and psychological stress [15]. Early detection of the disease can play a critical role in improving MS survival by increasing the proportion of patients diagnosed at early stages [16]. To do this, traditional classification techniques including logistic regression have been widely used in different medical problems to detect cases and controls. While there can obtain simple interpretations from these models, they usually cannot account for complex relationship between variables. So, the need to use newly developed models with the least prediction error is evident and a precise and reliable system is required to early diagnosis of the patients. Most of modern medical diagnosing tools are constructed based on classification and are adapted by many researchers to improve the precision. Recently, machine learning techniques have become very popular and have been widely used in several research area including medicine especially in classification problems [17, 18]. These methods learn through experience to improve their performance and can help physicians to better diagnose new patients by increasing sensitivity and in decision-making [19]. Although the main objective of these models is to identify effective variables and their relationships, these models can be used to predict and estimate the effects [20, 21]. Various machine learning methods have been introduced in different studies [22-24]. Examples of them include Naive Bayes (NB), Decision Trees, Random Forest (RF), Nearest Neighbor, AdaBoost, Support Vector Machine (SVM), RBF Network, and Multilayer Perceptron machine learning techniques to predict different outcomes [25-27]. Although different studies have shown that the performance of data mining techniques is better than that of the traditional techniques in terms of higher accuracy and lower error rates, this excellence does not happen in all data sets [28] and there are inconsistencies among various studies. So, investigation and comparison of the performance of different methods in different data sets is of great importance. The present study aimed to conduct a comprehensive comparison of four machine learning techniques of NB, Least Square Support Vector Machine (LSSVM), SVM and RF and two traditional methods (Logistic Regression (LR) and Linear Discriminant Analysis [29]) in prediction of MS to distinguish people with MS from healthy people in Iran.

Methods

DATA SOURCE

This study has been approved by the Research Council of the University of Medical Sciences of Hamadan (ID: 9204181211). The data was collected through a case-control study in Hamadan Province, the west of Iran, from September 2013 to March 2014. Participants were voluntarily entered into the study. Due to the lack of intervention, merely verbal informed consent was obtained from the participants. Based on Asadollahi et al. [30], in the patients with MS 80% of the participant was female and in the control group, this percent was 60%. According to this finding, the sample size for each group was 100, the total sample size was 200, at 95% significance level and 80% statistical power. Moreover, 100 definite patients with MS invited to the study as a case group compared to 100 infectious diseases patients as control group who had not a history of neurological disorder. In order to make the study groups similar, individuals from case and control were entered at the same time and in the same hospital. Cases and controls were selected from patients who referred to Farshchian Hospital’s neurology clinic and infectious diseases clinic, respectively. The Farshchian Hospital, where the study was conducted, is a referral center to which patients referred from different cities of the province. To make similar the study base of both case and control groups, we decided to select the control group from the Infectious Diseases Ward that was next to the Neurology Ward. Furthermore, the clinical and laboratory information of the control group was available and accessible from their medical records. Regardless of age, gender, and disease onset’s date cases were selected. In this study, the individual case was defined as an MS patient who was diagnosed with a neurologist and a brain MRI or a total spinal MRI. The patients with the following criteria were entered to the study: 1) diagnosed during the past 10 years; 2) inhabitant of Hamedan Province; 3) undertreatment and had a complete medical recorder in Farshchian Hospital. Satisfaction and accessibility of patients to study entry was required. The individual control was defined as an infectious disease patient without a neurological disorder seeking medical care. Patients of infectious diseases who have come from other jurisdictions have been disqualified. A standardized questionnaire, embracing of 40 items, was designed for the data collection on socio-demographic characteristics and environmental factors. It included data on gender, age at diagnosis, occupation, marital status, educational level, weight, height, history of smoking, exclusive breastfeeding, history of measles, family history of MS, birth season, history of immune system disease, blood group, and RH variable. The Body Mass Index (BMI), which is the ratio of body weight in kg to height in square meters, was classified into three categories of individuals with BMI underweight (BMI < 18.5), average individuals (BMI = 18.5-24.9), and overweight or obese individuals (BMI ≥ 25). Moreover, to assess the participants’ personality type the Friedman-Rosenman standard questionnaire was used. There were 25 two-choice (yes/no) questions with a total score of 25 in the questionnaire. Patients’ scores were classified to ≥ 13 and < 13 as type A and type B personality, respectively [31, 32]. The personality questionnaires reliability, used by Cronbach’s alpha coefficient, was 0.77. Face-to-face interviews were carried out to collect data.

DATA MINING ALGORITHMS

Naive Bayes (NB)

This classification method is based on the theorem of Bayes, which is straightforward, simple and quick [33, 34]. Once the test and train datasets have been allocated, the prior probability of belonging to each class can be determined using the train set using the conditional probability of independent variables Xi, given the class label C of the output variable. The probability of C is computed the using a class label product probabilities and the conditional probability of independent variables given the class label in theory and based on the Bayes theorem. Based on the above formula, the class with the highest posterior probability is given a new event [25].

Support vector machine (SVM)

SVM is a mapping function that uses a classification or regression model that is well known as a flexible method. To perform the classification method, a nonlinear kernel function is implemented to transform independent variables into high dimensional space, in which cases can be differentiated very well. The Radial Basis Function (RBF) kernel makes a trade-off between the misclassification of the training sample against the simplicity of the decision surface (cost parameter). The outcome variable class is best differentiated by using the maximum-margin hyperplanes in the data. A minimal generalizing error is achieved when the distance between the hyperplanes is accomplished by comparing two parallel hyperplanes on either side of the separating hyperplane [35].

Least Square Support Vector Machine (LSSVM)

The LS-SVM is a modified model with the least squares of the loss function and the equality constrain of the SVM model, in which rather than the quadratic programming problem, the dual solution could be found by solving a linear system. The LS-SVM function, also, maps the data into a high dimensional space, in case of SVM. The primal formulation of the LS-SVM classification model is minimized with the equality constraint as: [36].

Random Forest (RF)

The RF method was introduced by Leo Breiman [37] where the regression trees and classification are assembled. In this method, the trees are generated by using a replacement sampling of the main dataset. Using the independent variables that evaluate the outcome and the random subset of the predictors, the nodes are built. The most effective predictors can be found using mean decrease Gini and mean decrease accuracy [37].

Logistic Regression (LR)

This method assumes that the binary outcome is distributed binomially. The model can be written as: In this model, X’s are the covariates and bi is the regression coefficients denoting the effect size’s measure [38].

Linear Discriminant Analysis (LDA) [29]

LDA is similar to LR and refers to a linear combination of predictors that can achieve clear interpretations of the dependent variable. LDA addresses the problem with the predictor’s conditional probability given the output class. This method maximizes the dispersion between the different class cases and minimizes it between the same class cases [39].

EVALUATION CRITERIA AND CROSS VALIDATION

To compare the discriminative powers of the classification methods, several criteria of sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), positive likelihood ratio (LR+), negative likelihood ratio (LR-) and total accuracy were calculated using the following formulas: Where FP indicates people with MS that were incorrectly identified as healthy, TP stands for patients with MS that were correctly diagnosed as MS, TN stands for healthy controls that were correctly identified as healthy people and FN stands for patients with MS who incorrectly identified as healthy. The most important variables are chosen to demonstrate how each variable contributes to the uniformity of the nodes and leaves in the resulting RF by its greatest mean Gini decrease [37, 40]. The Gini coefficient for the child nodes is measured and compared to that of the original node each time a particular variable is used for splitting a node. Furthermore, Partial Dependence Plot demonstrates the nature of the dependence of the approximate estimation of function on each explanatory variable. The research has been conducted using RStudio software v 3.6.2.

Results

DATA DESCRIPTION

The data set included 200 patient records in which 100 definite patients with MS invited to the study as case group compared to 100 infectious disease patients as controls group who had not a history of neurological disorder. Table I displays the demographic and clinical characteristics of the participants. Females accounted for 80% of cases and 48% of controls (P < 0.001). The control group’s mean (SD) age was higher than the case group’s; 41.2 (14.8) years vs 36.1 (11.5) years respectively. Most of the participants were married and had no academic degree. In controls group, the smoking statues ratio was significantly higher than in cases (27 vs 6%; P < 0.001). Nevertheless, in cases, the number of widows and divorcees was lower than in controls, but there was no statistically significant difference (P = 0.074). Breastfeeding in cases was higher in comparison to the controls (P <0.01). Moreover, in the cases, patients with a history of measles were lower than in controls (P <0.05).

Tab. I.

demographic and clinical characteristics of the case and control groups.

Variable	Cases (%)	Controls (%)	P-value
Gender
Male	20(20)	52(52)	0.000
Female	80(80)	48(48)
Marital status
Single	25(25)	19(19)	0.3
Married	75(75)	81(81)
Educational level
Non-academic	63(63)	72(72)	0.1
Academic	37(37)	28(28)
Positive family history
No	90(90)	94(94)	0.2
Yes	10(10)	6(6)
Smoking status
Non-smoker	94(94)	73(73)	0.000
Smoker	6(6)	27(27)
Exclusive breast feeding
Non-breast feeding	22(22)	6(6)	0.001
Breast feeding	78(78)	94(94)
History of measles
No	64(64)	50(50)	0.04
Yes	36(36)	50(50)
Season of birth
Spring	23(23)	33(33)	0.08
Summer	27(27)	31(31)
Autumn	29(29)	15(15)
Winter	21(21)	21(21)
Blood group
AB	5(5)	14(14)	0.3
A	21(21)	22(22)
B	20(20)	24(24)
O	25(25)	40(40)
Blood Rh
Negative	13(13)	24(24)	0.2
Positive	60(60)	73(73)
BMI
Underweight	2(2)	4(4)	0.6
Normal weight	32(32)	33(33)
Overweight & obesity	32(32)	43(43)
Type of personality
B	30(30)	40(40)	0.1
A	70(70)	60(60)

PERFORMANCE OF THE MODELS

In order to avoid overfitting, we divided the data into two sets of scenarios including training (70%) and testing set (30%) and training (50%) and testing (50%). We also repeated this process 100 times and reported the evaluation criteria as average over 100 repetitions. Table II provides a comparison of the sensitivity, specificity, PPV, NPV, total accuracy, LR+ and LR- for the classification methods training and test sets. According to the results, in both scenarios, all methods performed quite similarly in terms of LR+ and LR-. Higher accuracy was achieved by the RF in both scenarios and the SVM method in comparison to others.

Tab. II.

Mean and standard deviation of sensitivity, specificity, PPV, NPV, total accuracy, positive LR and negative LR for various models.

Scenario	Models	Sensitivity		Specificity		PPV		NPV		TA		LR+		LR-
		Mean	Std.dv	Mean	Std.dv	Mean	Std.dv	Mean	Std.dv	Mean	Std.dv	Mean	Std.dv	Mean	Std.dv
70, 30	NB	0.79	0.10	0.55	0.13	0.64	0.08	0.74	0.10	0.67	0.05	1.92	0.54	0.35	0.14
	LSSVM	0.61	0.09	0.67	0.09	0.65	0.08	0.64	0.07	0.64	0.05	2.06	0.86	0.51	0.01
	RF	0.71	0.08	0.67	0.08	0.68	0.07	0.70	0.08	0.68	0.04	2.06	0.61	0.51	0.11
	SVM	0.72	0.89	0.64	0.1	0.67	0.08	0.70	0.09	0.68	0.05	2.06	0.65	0.51	0.13
	LR	0.67	0.08	0.65	0.1	0.66	0.08	0.66	0.09	0.66	0.06	2.06	0.64	0.51	0.15
	LDA	0.68	0.08	0.64	0.10	0.66	0.08	0.66	0.09	0.55	0.06	2.06	0.66	0.51	0.16
50, 50	NB	0.77	0.15	0.56	0.15	0.64	0.07	0.74	0.10	0.66	0.05	1.90	0.45	0.37	0.17
	LSSVM	0.62	0.10	0.63	0.10	0.63	0.07	0.63	0.06	0.63	0.04	1.90	0.46	0.51	0.13
	RF	0.71	0.09	0.57	0.09	0.68	0.07	0.70	0.07	0.68	0.04	1.90	0.43	0.51	0.11
	SVM	0.69	0.10	0.63	0.1	0.65	0.06	0.68	0.08	0.66	0.04	1.90	0.42	0.51	0.13
	LR	0.67	0.09	0.63	0.08	0.65	0.06	0.66	0.07	0.66	0.04	1.90	0.65	0.51	0.11
	LDA	0.68	0.09	0.63	0.09	0.64	0.06	0.66	0.07	0.65	0.04	1.90	0.50	0.51	0.12

As the performances of all methods in the classification of MS patients and controls were similar, we calculated the variable importance to rank the role of the variables in predicting MS. According to the results shown in Figure 1, the points represent the mean decrease Gini value, indicative of the importance of each variable in the RF plot, age was the first top rank variable in predicting MS. Also, season and sex were the second and third top rank variables in terms of the Mean decrease in the Gini index. Here, we used a threshold of 10 for Gini index, then we chosethree variables as the most important variable.

Fig. 1.

Variable importance in predicting MS disease using RF model.

Moreover, the partial dependence plot (PDP) of the classes was computed and visualized the relationship between prediction of MS on different features for the RF. Figure 2 shows that there is an MS prediction for female, married, non-academic education, history of measles, birth in spring, history of smoking and b personality type.

Fig. 2.

Partial plots for variables in predicting MS using RF.

Discussion

The present study was aimed at a comprehensive comparison of six machine learning techniques of NB, LSSVM, SVM, RF and two traditional methods (LR and LDA) for the prediction of MS to distinguish people with MS from healthy people in Iran. For all six methods, the performance criteria were very similar among classifiers, however, they derived from different algorithm approaches. Based on the total accuracy, it was shown that in both scenarios: 1) 70% training and 30% testing; and 2) 50% training, 50% testing), all classification methods performed almost the same for the classification of MS cases and controls (ranged: 0.54 to 0.68). Only, one of the six classifiers tested showed a total accuracy value lower than 0.6 (LDA with total accuracy of 0.55). In other words, in predicting the classes for both case and control groups, all the classification methods provided similar accuracy. However, the total accuracy of the RF model was slightly more than others in both scenarios (0.68). In the 70, 30 scenarios, the sensitivity varied from at least 0.61 in LSSVM to at most 0.79 in the NB model. This indicator is also accurate in 50, 50 scenarios (0.77 in NB model). In the case of specificity, however, the RF model performed better than other models (0.67), the NB model was poor (0.55). This quality also remains true for PPV. In other words, RF is the best model based on the NPV and PPV criteria. The maximum sensitivity and NPV value belonged to NB. However, the RF model outperformed other models on the basis of the other reliability indices and it is more effective than NB, LSSVM, SVM, LR and LDA. Moreover, RF and NB showed similar accuracies. Since, they were the most common algorithms used in practice [29, 30, 41, 42], RF model was used for additional analysis. Our finding indicated age as the highest risk factor associated with MS prediction. This result is consistent with the findings [1, 7, 43]. MS is more likely to occur in the 20-40 age group [1, 7, 43]. Our analysis indicates patients in their late 20 to mid-30 were at a high risk of MS. The PDP showed that the predicted MS probability is low until 50 increases after. The result of previous studies was inconsistent with this finding [44] . According to the finding, season on birth was the second important variable in predicting MS patients, consistent with previous findings [11]. The PDP presented that the MS risk in patients who born in spring and summer was more common. Cruz et al in the United Kingdom also founded that spring-born patients are at greater risk than autumn-born patients [45]. Walleczek et al study also found a significant rise in MS births in April and a decline in November [46]. On the other hand, our analysis opposed the results of some previous literature that reported autumn-born patients had a higher risk of MS than spring [1, 47]. In 2019, a systematic survey and multivariate meta-analysis was conducted to address this conflict and revealed that in the northern hemisphere, the impact of the birth season was related to latitude, annual dry bulb temperature and sunshine period. For populations in latitudes > 52° this impact was restricted to the sunshine period [48] . The third factor that influences the prediction was gender. According to a PDP, the probability of having MS is more likely to be diagnosed in females than in males. Our finding was performed the similar result of preceding research [3, 49, 50]. This can be due to the disparity between women and men in the immune state, nervous system, and lifestyle in both sexes [3]. The propensity to have fewer children and have them later in life than their grandmothers is one of the big changes in the life of the contemporary woman. Due to temporary immunosuppressant during pregnancy [51], pregnancy can have a protective impact against MS in women, and a higher age may have a share of the increased incidence of MS in women when giving birth to the first child or fewer pregnancies [51]. There were several limitations to our study. Firstly, in order to establish the models, we did not focus on quantitative MRI features. Further work plans to incorporate additional biomarker data. Second, there was some limitation in the number of samples and the matching of age and sex in both cases and control groups.

Conclusions

The aim of this research was to evaluate the performance of four machine learning and two classical techniques in predicting MS patients. Our findings suggest that in this study, RF was the best model for predicting MS in terms of multiple criteria between two group patients. Variable importance in predicting MS disease using RF model. Partial plots for variables in predicting MS using RF. demographic and clinical characteristics of the case and control groups. Mean and standard deviation of sensitivity, specificity, PPV, NPV, total accuracy, positive LR and negative LR for various models.

34 in total

Review 1. Aging and multiple sclerosis.

Authors: Shaik Ahmed Sanai; Vasu Saini; Ralph Hb Benedict; Robert Zivadinov; Barbara E Teter; Murali Ramanathan; Bianca Weinstock-Guttman
Journal: Mult Scler Date: 2016-02-19 Impact factor: 6.312

2. Incidence and prevalence of multiple sclerosis in southeastern Iran.

Authors: Ali Moghtaderi; Forough Rakhshanizadeh; Shahryar Shahraki-Ibrahimi
Journal: Clin Neurol Neurosurg Date: 2012-06-18 Impact factor: 1.876

3. Making sense of illness or disability: the nature of sense making in multiple sclerosis (MS).

Authors: Kenneth I Pakenham
Journal: J Health Psychol Date: 2008-01

4. Overview of the epidemiology, diagnosis, and disease progression associated with multiple sclerosis.

Authors: Mark J Tullman
Journal: Am J Manag Care Date: 2013-02 Impact factor: 2.229

5. Multiple sclerosis in Isfahan, Iran: Past, Present and Future.

Authors: Masoud Etemadifar; Seyed-Hossein Abtahi
Journal: Int J Prev Med Date: 2012-05

6. Using the random forest method to detect a response shift in the quality of life of multiple sclerosis patients: a cohort study.

Authors: Mohamed Boucekine; Anderson Loundou; Karine Baumstarck; Patricia Minaya-Flores; Jean Pelletier; Badih Ghattas; Pascal Auquier
Journal: BMC Med Res Methodol Date: 2013-02-15 Impact factor: 4.615

7. Smoking and multiple sclerosis: an updated meta-analysis.

Authors: Adam E Handel; Alexander J Williamson; Giulio Disanto; Ruth Dobson; Gavin Giovannoni; Sreeram V Ramagopalan
Journal: PLoS One Date: 2011-01-13 Impact factor: 3.240

8. Regional variation in the incidence rate and sex ratio of multiple sclerosis in Scotland 2010-2017: findings from the Scottish Multiple Sclerosis Register.

Authors: Patrick K A Kearns; Martin Paton; Martin O'Neill; Chrissie Waters; Shuna Colville; James McDonald; Ian J B Young; Dan Pugh; Jonathon O'Riordan; Belinda Weller; Niall MacDougall; Tom Clemens; Chris Dibben; James F Wilson; Marcia C Castro; Alberto Ascherio; Siddharthan Chandran; Peter Connick
Journal: J Neurol Date: 2019-06-11 Impact factor: 4.849

9. Study of type a and B behavior patterns in patients with multiple sclerosis in an Iranian population.

Authors: Vahid Shaygannejad; Sedigheh Rezaei Dehnavi; Fereshteh Ashtari; Somayeh Karimi; Leila Dehghani; Rokhsareh Meamar; Zahra Tolou-Ghamari
Journal: Int J Prev Med Date: 2013-05

10. Multiple Sclerosis Associated Risk Factors: A Case-Control Study.

Authors: Jalal Poorolajal; Mehrdokht Mazdeh; Mohammad Saatchi; Elaheh Talebi Ghane; Azam Biderafsh; Bahar Lotfi; Mohammad Feryadres; Khabat Pajohi
Journal: Iran J Public Health Date: 2015-11 Impact factor: 1.429