| Literature DB >> 31323843 |
Meghana Padmanabhan1, Pengyu Yuan1, Govind Chada1, Hien Van Nguyen2.
Abstract
Machine learning is often perceived as a sophisticated technology accessible only by highly trained experts. This prevents many physicians and biologists from using this tool in their research. The goal of this paper is to eliminate this out-dated perception. We argue that the recent development of auto machine learning techniques enables biomedical researchers to quickly build competitive machine learning classifiers without requiring in-depth knowledge about the underlying algorithms. We study the case of predicting the risk of cardiovascular diseases. To support our claim, we compare auto machine learning techniques against a graduate student using several important metrics, including the total amounts of time required for building machine learning models and the final classification accuracies on unseen test datasets. In particular, the graduate student manually builds multiple machine learning classifiers and tunes their parameters for one month using scikit-learn library, which is a popular machine learning library to obtain ones that perform best on two given, publicly available datasets. We run an auto machine learning library called auto-sklearn on the same datasets. Our experiments find that automatic machine learning takes 1 h to produce classifiers that perform better than the ones built by the graduate student in one month. More importantly, building this classifier only requires a few lines of standard code. Our findings are expected to change the way physicians see machine learning and encourage wide adoption of Artificial Intelligence (AI) techniques in clinical domains.Entities:
Keywords: artificial intelligence; auto machine learning; cardiovascular disease prediction; clinical domain; physician-friendly machine learning
Year: 2019 PMID: 31323843 PMCID: PMC6678298 DOI: 10.3390/jcm8071050
Source DB: PubMed Journal: J Clin Med ISSN: 2077-0383 Impact factor: 4.241
Figure 1The Auto-Sklearn pipeline [12] contains three main building blocks: (a) Data preprocessor, (b) Feature preprocessor, and (c) Estimator or machine learning algorithms.
Figure 2Python code for using Auto-Sklearn to train a classifier for any dataset.
Thirteen attributes of the Heart UCI (University of California, Irvine, CA, USA) dataset.
| Attribute | Type | Description |
|---|---|---|
| Age | Continuous | Age in years |
| Cp | Discrete | Chest pain type (4 values) |
| Trestbps | Continuous | Resting blood pressure (in mm Hg on admission to the hospital) |
| Chol | Continuous | Serum cholestoral in mg/dL |
| Fbs | Discrete | Fasting blood sugar > 120 mg/dL 1 = true; 0 = false |
| Restecg | Discrete | Resting electrocardiographic results (values 0,1,2) |
| Thalach | Continuous | Maximum heart rate achieved |
| Exang | Discrete | Exercise induced angina (1 = yes; 0 = no) |
| Oldpeak | Continuous | ST depression induced by exercise relative to rest |
| Slope | Discrete | The slope of the peak Exercise ST segment (values 0,1,2) |
| Ca | Discrete | Number of major vessels (0–4) colored by flourosopy |
| Thal | Discrete | Nature of defect, values (0–3) |
| Target | Discrete | Presence or absence of heart disease, values (1,0) |
Twelve attributes of Cardiovascular Diseases dataset.
| Attribute | Type | Description |
|---|---|---|
| Age | Continuous | Age of the patient in days |
| Gender | Discrete | 1: women, 2: men |
| Height (cm) | Continuous | Height of the patient in cm |
| Weight (kg) | Continuous | Weight of the patient in kg |
| Ap_hi | Continuous | Systolic blood pressure |
| Ap_lo | Continuous | Diastolic blood pressure |
| Cholesterol | Discrete | 1: normal, 2: above normal, 3: well above normal |
| Gluc | Discrete | 1: normal, 2: above normal, 3: well above normal |
| Smoke | Discrete | whether patient smokes or not |
| Alco | Discrete | Alcohol intake-Binary feature |
| Active | Discrete | Physical activity-Binary feature |
| Cardio | Discrete | Presence or absence of cardiovascular disease |
Figure 3Validation accuracy over 18 days by the graduate student on the Heart UCI dataset.
Figure 4Validation accuracy over 15 days by the graduate student on the Cardiovascular Disease dataset.
Comparison of AutoML and the graduate student’s classification performances and total time on UCI test set.
| Accuracy | AUC-ROC | AUC-PR | Total Ttime (h) | |
|---|---|---|---|---|
|
| 0.84 | 0.82 | 0.80 | 432 |
|
| 0.85 | 0.93 | 0.94 | 0.5 |
Comparison of AutoML and graduate student’s classification performances and total time on the Cardiovascular test set.
| Accuracy | AUC-ROC | AUC-PR | Total Time (h) | |
|---|---|---|---|---|
|
| 0.74 | 0.73 | 0.68 | 360 |
|
| 0.74 | 0.8 | 0.79 | 0.5 |
Accuracies reported by previous studies on the Heart UCI Dataset compared to accuracies of the graduate student and AutoML.
| Author | Reported Accuracy |
|---|---|
| Shouman et al. [ | 0.841 |
| Duch et al. [ | 0.856 |
| Wang et al. [ | 0.8337 |
| Srinivas et al. [ | 0.837 |
| Tomar and Agarwal [ | 0.8559 |
| Graduate student (this paper) | 0.84 |
| AutoML (this paper) | 0.85 |