| Literature DB >> 31702773 |
Alena Orlenko1, Daniel Kofink2, Leo-Pekka Lyytikäinen2, Kjell Nikus3,4, Pashupati Mishra3,4, Pekka Kuukasjärvi5, Pekka J Karhunen6, Mika Kähönen7, Jari O Laurikka5, Terho Lehtimäki3,4, Folkert W Asselbergs2,8,9, Jason H Moore1.
Abstract
MOTIVATION: Selecting the optimal machine learning (ML) model for a given dataset is often challenging. Automated ML (AutoML) has emerged as a powerful tool for enabling the automatic selection of ML methods and parameter settings for the prediction of biomedical endpoints. Here, we apply the tree-based pipeline optimization tool (TPOT) to predict angiographic diagnoses of coronary artery disease (CAD). With TPOT, ML models are represented as expression trees and optimal pipelines discovered using a stochastic search method called genetic programing. We provide some guidelines for TPOT-based ML pipeline selection and optimization-based on various clinical phenotypes and high-throughput metabolic profiles in the Angiography and Genes Study (ANGES).Entities:
Mesh:
Year: 2020 PMID: 31702773 PMCID: PMC7703753 DOI: 10.1093/bioinformatics/btz796
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Example of the TPOT pipeline
Comparative analysis of the TPOT optimization of selected model with various performance metrics for P1(A) and P2(B).
| Model | Balanced accuracy V/T | Precision V/T | Recall V/T | ROC AUC V/T | PRC V/T | Pipeline complexity |
|---|---|---|---|---|---|---|
| A. P1 | ||||||
| A1. TPOT (BNB) |
|
| 0.77/0.79 |
|
| 5 |
| A2. LR TPOT | 0.76/0.79 | 0.90/0.93 | 0.79/0.80 | 0.76/0.86 | 0.87/0.95 | 9 |
| A3. DT TPOT | 0.70/0.74 | 0.88/0.89 | 0.71/0.80 | 0.70/0.74 | 0.85/0.87 | 7 |
| A4. RF TPOT | 0.69 /0.69 | 0.85/0.86 | 0.85/0.88 | 0.61/0.81 | 0.81/0.94 | 4 |
| A5. LR GS | 0.68/0.72 | 0.85/0.87 | 0.85/0.90 | 0.68/0.87 | 0.83/0.95 | 1 |
| A6. DT GS | 0.61/0.67 | 0.81/0.83 | 0.86/0.84 | 0.61/0.72 | 0.81/0.87 | 1 |
| A7. RF GS | 0.61/0.66 | 0.81/0.84 |
| 0.64/0.81 | 0.82/0.93 | 1 |
| B. P2 | ||||||
| B1. TPOT (BNB) |
|
| 0.79/0.79 |
|
| 5 |
| B2. LR TPOT | 0.77/0.75 | 0.80/0.80 | 0.84/0.80 | 0.77/0.84 | 0.76/0.90 | 5 |
| B3. DT TPOT | 0.75/0.76 | 0.78/0.81 | 0.84/0.81 | 0.72/0.80 | 0.73/0.84 | 6 |
| B4. RF TPOT | 0.76/0.76 | 0.78/0.81 |
| 0.75/0.83 | 0.74/0.88 | 4 |
| B5. LR GS | 0.73/0.74 | 0.76/0.78 | 0.84/0.84 | 0.73/0.85 | 0.73/0.89 | 1 |
| B6. DT GS | 0.74/0.73 | 0.78/0.80 | 0.81/0.76 | 0.74/0.78 | 0.74/0.83 | 1 |
| B7. RF GS | 0.72/0.72 | 0.77/0.78 | 0.76/0.81 | 0.69/0.83 | 0.70/0.88 | 1 |
: Metrics’ score are shown for validation (V) and training (T) set. The highest score in each metrics category is marked via bold font. The best model for each phenotypic profile was selected according to the highest balanced accuracy. BNB, Bernoulli Naïve Bayes classifier; LR, logistic regression classifier; DT, decision tree classifier; RF, random forest classifier; GS, grid search optimization.
Fig. 2.ROC AUC curves for selected models for P1 dataset (A) and P2 dataset (B)
Fig. 3.Precision-recall curves for selected models for P1 dataset (A) and P2 dataset (B)
The complexity-performance relationship for models selected by the TPOT optimization for P1(A) and P2(B)
| Model | Balanced accuracy V/T | Precision V/T | Recall V/T | ROC AUC V/T | PRC V/T | Pipeline complexity |
|---|---|---|---|---|---|---|
| A. P1 | ||||||
| Model A1 | 0.77/0.79 | 0.91/0.93 | 0.77/0.79 | 0.77/0.86 | 0.88/0.95 | 5 |
| Pr-1 | 0.74/0.77 | 0.90/0.92 | 0.77/0.79 | 0.74/0.84 | 0.86/0.94 | 4 |
| Pr-2 | 0.67/0.73 | 0.86/0.90 | 0.73/0.72 | 0.67/0.80 | 0.83/0.92 | 3 |
| Pr-3 | 0.64/0.69 | 0.84/0.89 | 0.7/0.68 | 0.64/0.76 | 0.82/0.91 | 2 |
| Pr-4 | 0.62/0.61 | 0.81/0.81 | 0.95/0.96 | 0.62/0.85 | 0.81/0.95 | 1 |
| B. P2 | ||||||
| Model B1 | 0.78/0.78 | 0.82/0.84 | 0.79/0.79 | 0.78/0.82 | 0.78/0.86 | 5 |
| Pr-1 | 0.78/0.76 | 0.82/0.82 | 0.81/0.79 | 0.78/0.81 | 0.78/0.86 | 4 |
| Pr-2 | 0.74/0.76 | 0.79/0.82 | 0.77/0.78 | 0.74/0.81 | 0.74/0.86 | 3 |
| Pr-3 | 0.68/0.75 | 0.75/0.82 | 0.7/0.75 | 0.68/0.8 | 0.7/0.85 | 2 |
| Pr-4 | 0.68/0.75 | 0.75/0.82 | 0.7/0.75 | 0.68/0.8 | 0.7/0.85 | 1 |
: Model ‘Pr-1’, ‘Pr-2’ etc. indicate the number of pre-processors removed from the original model pipeline.
Comparative analysis of the grid search optimization of selected ML algorithms with SS, SP and RFE pre-processing operators for P1(A) and P2(B)
| Model | Balanced accuracy V/T | Precision V/T | Recall V/T | ROC AUC V/T | PRC V/T | Pipeline complexity |
|---|---|---|---|---|---|---|
| A. P1 | ||||||
| LR pipelines | ||||||
| LR | 0.68/0.72 | 0.85/0.87 | 0.85/0.90 | 0.68/0.87 | 0.84/0.95 | 1 |
| LR +SS | 0.68/0.77 | 0.86/0.92 | 0.76/0.80 | 0.68/0.84 | 0.84/0.94 | 2 |
| LR + SS + SP | 0.68/0.78 | 0.86/0.92 | 0.76/0.81 | 0.68/0.84 | 0.84/0.94 | 3 |
| LR + SS + RFE | 0.73/0.77 | 0.88/0.91 | 0.80/0.81 | 0.73/0.84 | 0.86/0.94 | 3 |
| BNB pipelines | ||||||
| BNB | 0.72/0.77 | 0.88/0.91 | 0.76/0.78 | 0.72/0.85 | 0.85/0.95 | 1 |
| BNB + SS | 0.66/0.72 | 0.85/0.89 | 0.73/0.73 | 0.66/0.79 | 0.83/0.92 | 2 |
| BNB + SS + SP | 0.71/0.76 | 0.88/0.92 | 0.76/0.75 | 0.71/0.84 | 0.85/0.95 | 3 |
| BNB + SS + RFE | 0.70/0.73 | 0.88/0.90 | 0.75/0.76 | 0.70/0.82 | 0.85/0.94 | 3 |
| B. P2 | ||||||
| LR pipelines | ||||||
| LR | 0.73/0.74 | 0.76/0.78 | 0.84/0.84 | 0.73/0.85 | 0.73/0.89 | 1 |
| LR +SS | 0.69/0.76 | 0.73/0.81 | 0.79/0.80 | 0.69/0.84 | 0.70/0.89 | 2 |
| LR + SS + SP | 0.72/0.76 | 0.76/0.81 | 0.80/0.79 | 0.72/0.83 | 0.73/0.89 | 3 |
| LR + SS + RFE | 0.74/0.74 | 0.78/0.80 | 0.81/0.80 | 0.74/0.84 | 0.74/0.89 | 3 |
| BNB pipelines | ||||||
| BNB | 0.70/0.74 | 0.75/0.80 | 0.77/0.79 | 0.70/0.80 | 0.71/0.85 | 1 |
| BNB + SS | 0.63/0.66 | 0.69/0.75 | 0.69/0.67 | 0.63/0.74 | 0.66/0.81 | 2 |
| BNB + SS + SP | 0.74/0.75 | 0.80/0.83 | 0.74/0.75 | 0.74/0.82 | 0.74/0.87 | 3 |
| BNB + SS + RFE | 0.65/0.67 | 0.71/0.76 | 0.74/0.73 | 0.65/0.77 | 0.68/0.83 | 3 |
: SS, standard scaler; SP, select percentile; RFE, recursive feature eliminator; LR, logistic regression classifier; BNB, Bernoulli Naïve Bayes classifier.
Fig. 4.PFI coefficients produced by Model A1 for P1 dataset with validation set balanced accuracy 0.77 (A) and by Model B1 for P2 dataset with validation set balanced accuracy 0.78 (B)
Fig. 5.Normalized confusion matrix for selected TPOT-optimized models for P1 dataset (A) and P2 dataset (B)