| Literature DB >> 32998684 |
Elisabetta Manduchi1,2, Weixuan Fu3, Joseph D Romano4, Stefano Ruberto4, Jason H Moore4,3.
Abstract
BACKGROUND: A typical task in bioinformatics consists of identifying which features are associated with a target outcome of interest and building a predictive model. Automated machine learning (AutoML) systems such as the Tree-based Pipeline Optimization Tool (TPOT) constitute an appealing approach to this end. However, in biomedical data, there are often baseline characteristics of the subjects in a study or batch effects that need to be adjusted for in order to better isolate the effects of the features of interest on the target. Thus, the ability to perform covariate adjustments becomes particularly important for applications of AutoML to biomedical big data analysis.Entities:
Keywords: AutoML; Covariate adjustment; Feature importance; Genetic programming; Pathways
Mesh:
Year: 2020 PMID: 32998684 PMCID: PMC7528347 DOI: 10.1186/s12859-020-03755-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 4The three possible workflows for resAdj TPOT. From top to bottom the workflows when the needed adjustments are for features only, both features and target, and target only are displayed. The Feature Set Selector (FSS) step is optional. ‘adjY’ denotes the no-leakage adjustment of the target for each predefined CV split
Fig. 1Boxplots for the results of 100 runs of TPOT on the TG-GATEs data set. Each point corresponds to one run and resides in the boxplot for the pathway (Feature Set) selected in the optimal pipeline for that run. The number above each boxplot indicates the fraction of runs where that pathway was selected. a Results for resAdj TPOT; the y-coordinate indicates the R2 on the hold-out Testing dataset. b Results for classic TPOT; the y-coordinate indicates the balanced accuracy on the hold-out Testing dataset
Fig. 2Permutation importance from 100 runs of TPOT on the TG-GATEs data set. The gene names are displayed on the y-axis and the weighted (by testing score) averages of the mean score decrease as a percentage of the score are displayed on the x-axis. a resAdj TPOT; the top 20 features and 7 covariates are shown. The analyses were done at the probeset level, and for Rhoa there were two probesets among the top 20 features. b Classic TPOT; the top 20 features are shown. Also in this case, for some of the genes there were two probesets among the top 20 features
Fig. 3Boxplots for the results of 100 runs of TPOT on the PsychENCODE data set. Each point corresponds to one run and resides in the boxplot for the pathway (Feature Set) selected in the optimal pipeline for that run. The number above each boxplot indicates the fraction of runs where that pathway was selected. a Results for resAdj TPOT; the y-coordinate indicates the R2 on the hold-out Testing dataset. b Results for classic TPOT; the y-coordinate indicates the balanced accuracy on the hold-out Testing dataset