| Literature DB >> 34713318 |
Elisabetta Manduchi1,2, Joseph D Romano1, Jason H Moore3,4.
Abstract
The genetic analysis of complex traits has been dominated by parametric statistical methods due to their theoretical properties, ease of use, computational efficiency, and intuitive interpretation. However, there are likely to be patterns arising from complex genetic architectures which are more easily detected and modeled using machine learning methods. Unfortunately, selecting the right machine learning algorithm and tuning its hyperparameters can be daunting for experts and non-experts alike. The goal of automated machine learning (AutoML) is to let a computer algorithm identify the right algorithms and hyperparameters thus taking the guesswork out of the optimization process. We review the promises and challenges of AutoML for the genetic analysis of complex traits and give an overview of several approaches and some example applications to omics data. It is our hope that this review will motivate studies to develop and evaluate novel AutoML methods and software in the genetics and genomics space. The promise of AutoML is to enable anyone, regardless of training or expertise, to apply machine learning as part of their genetic analysis strategy.Entities:
Mesh:
Year: 2021 PMID: 34713318 PMCID: PMC9360157 DOI: 10.1007/s00439-021-02393-x
Source DB: PubMed Journal: Hum Genet ISSN: 0340-6717 Impact factor: 5.881
Fig. 1Flow for k-fold CV on algorithm/hyperparameter selection and evaluation. A indicates the selection of an algorithm with specified hyperparameters
For each of the AutoML systems that we discussed, the architecture type of the resulting pipeline and optimization method are indicated together with the type of applications described in this work
| System | Pipeline architecture | Optimization method | Application type |
|---|---|---|---|
| AutoGluon | Layers | Stacked Ensenble | B |
| AutoPrognosis | Ensemble | BO | B |
| Auto-sklearn | Fixed | CASH via BO (SMAC) | B |
| Auto-WEKA | Fixed (and simple) | CASH via BO (SMAC) | B |
| H2O | Ensemble | BO and SuperLearner | |
| Model search | NN | NAS via GP | |
| PennAI | Fixed (and simple) | Recommender | B |
| TPOT | Flexible | CASH via GP | B, O, G |
| TPOT-NN | NN | GP |
BO Bayesian optimization, B biomedical, not omics, O omics, but not genomics, G genomics
Fig. 2A hypothetical machine learning pipeline which could be discovered by TPOT. In the top branch of the pipeline, features are selected from a random forest (RF) analysis according to their importance scores and then subjected to a polynomial transformation. The transformed features are then analyzed using a k-nearest neighbors (kNN) algorithm with the output given to a decision tree (DT) as a new engineered feature. In the bottom branch, principal components (PCA) are analyzed by a support vector machine (SVM) with the output given to the DT. The DT performs the final classification using the newly engineered features from the kNN and SVM
Fig. 3The essence of genetic programming-based optimization is the selection of good AutoML pipelines (parents) and the introduction of variability to generate new pipelines (children) for evaluation. On the left are two selected parental pipelines. In the first pipeline, features are selected according to their importance scores from a random forest (RF) analysis and then given to a decision tree (DT) which performs the classification. The second pipeline performs a k-nearest neighbors (kNN) and gradient boosting (GB) analysis with the output given to a naïve Bayes (NB) algorithm for classification. Two new pipelines are created by randomly swapping or recombining the RF and kNN algorithms and mutating the NB algorithm to a logistic regression (LR) algorithm. This results in two new pipelines to be evaluated
Fig. 4An optimal TPOT pipeline derived from the analysis of metabolomics data (Orlenko et al. 2020). In the first step of the pipeline, an Extra Trees analysis is performed with recursive feature elimination to select a subset of most informative features. These selected features are then analyzed using Logistic Regression (LR) and the output included in the data set as a newly engineered feature. This same process is then repeated using a multinomial naïve Bayes (MNB) algorithm. The selected and engineered features are then scaled by subtracting the mean and dividing by the standard deviation. These newly transformed features are then used to classify subjects as cases with coronary artery disease (CAD) or controls with no CAD using a Bernoulli Naïve Bayes (BNB) classifier