| Literature DB >> 31156708 |
Mickael Leclercq1,2, Benjamin Vittrant1,2, Marie Laure Martin-Magniette3,4, Marie Pier Scott Boyer1,2, Olivier Perin5, Alain Bergeron1,6, Yves Fradet1,6, Arnaud Droit1,2.
Abstract
The identification of biomarker signatures in omics molecular profiling is usually performed to predict outcomes in a precision medicine context, such as patient disease susceptibility, diagnosis, prognosis, and treatment response. To identify these signatures, we have developed a biomarker discovery tool, called BioDiscML. From a collection of samples and their associated characteristics, i.e., the biomarkers (e.g., gene expression, protein levels, clinico-pathological data), BioDiscML exploits various feature selection procedures to produce signatures associated to machine learning models that will predict efficiently a specified outcome. To this purpose, BioDiscML uses a large variety of machine learning algorithms to select the best combination of biomarkers for predicting categorical or continuous outcomes from highly unbalanced datasets. The software has been implemented to automate all machine learning steps, including data pre-processing, feature selection, model selection, and performance evaluation. BioDiscML is delivered as a stand-alone program and is available for download at https://github.com/mickaelleclercq/BioDiscML.Entities:
Keywords: biomarkers signature; feature selection; machine learning; omics; precision medicine
Year: 2019 PMID: 31156708 PMCID: PMC6532608 DOI: 10.3389/fgene.2019.00452
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1BioDiscML pipeline. Preprocessing and feature selection procedures are fully parallelizable, When all features-optimized models are computed, the model selection starts. The program can be also started from the checkpoint at any moment during the execution. *The Set of ML classifiers is the set of pre-configured commands in classifiers.conf file. All classifiers are listed in the Supplementary Table S1. **Criterions are optimized metrics, evaluated by 10-folds cross validation (10 CV), used to assess if a model is improved, such as accuracy, balanced error rate, Matthew's correlation coefficient, area under the curve, sensitivity, specificity, Root Mean Squared Error, etc. (see Evaluation Criterion). ***Feature selection methods include forward stepwise selection (FSS), backward stepwise selection (BSS), forward stepwise selection and backward stepwise elimination (FSSBSE), backward stepwise selection, and Forward stepwise elimination (BSSFSE), and “top k” features (see Optimal Feature Subset Search Methods).
Figure 2BioDiscML accepts as input one ({} only) or many ({, ..,}) symbol-separated table-like structured datasets containing samples in row and features in columns.
Description of the real-world datasets used to evaluate the performance of BioDiscML vs. recent tools.
| Stem cells | Fifteen merged transcriptomics microarray sets from multiple platforms. They contain three types of human cells as classes: human Fibroblasts (Fib), embryonic stem cells (ESC), and induced pluripotent stem cells (IPSC) | 13,315 | Train set: 62 ESC, 105 IPSC, 43 Fib | Rohart et al., |
| Colon cancer | Transcriptomics microarray available from ColonCA R package in Bioconductor (Gentleman et al., | 2,000 | Sixty-two patients, including 40 tumors and 22 normal cases | Alon et al., |
| Central nervous system | Microarray gene expression data derived from central nervous system of patients brain tumors to predict embryonal tumor outcome | 7,129 | Sixty patients, including 39 medulloblastoma survivors, and 21 treatment failures cases | Pomeroy et al., |
| Diffuse large B-cell lymphoma (DLBCL) | Transcriptomic microarray of pre-treatment biopsies tumor specimens separated in DLBCL and follicular lymphoma | 2,647 | Seventy-seven patients, including 58 DLBCL and 19 follicular lymphoma | Shipp et al., |
| Prostate cancer | Microarray expression analysis was used to determine gene expression levels differences between tumor and non-tumor prostate samples | 2,135 | One hundred two patients, including 52 tumor and 50 normal cases | Singh et al., |
Figure 3BER comparison of MINT vs. BioDiscML. Train BER value was obtained by LOGOCV performance evaluation and test BER value using holdout validation. Values are in percentage.
Figure 4Boxplot of AUCs bootstrapping over 100 iterations of most performant AucPR methods called AucL (AucPR with Lasso) and AucEN (AucPR with ElasticNet), vs. BioDiscML most performant model (Hoeffding Tree).
Performances of RGIFE vs. BioDiscML measured by accuracy obtained through 10-fold cross validation (10CV_ACC) and bootstrapping (BS_ACC).
| CNS | 77.1 | Not reported | KNN | 100 | 80.7 | 12 | A2DE | BSSFSE | AUC |
| 93.3 | 98.6 | 11 | HT | FSSBSE | AUC | ||||
| DLBCL | 68 | 9 | RF | 100 | 93 | 6 | A1DE | FSSBSE | MCC |
| 98.7 | 98.3 | 6 | NB | FSSBSE | AUC | ||||
| Prostate cancer | 95.2 | 158 | SVM | 100 | 91 | 12 | VFI | BSSFSE | ACC |
| 99 | 95.7 | 10 | NB | FSSBSE | AUC | ||||
Classifiers evaluated by RGIFE were K-Nearest Neighbors (KNN), Random Forest (RF), and Support Vector Machines (SVM). Most performant classifiers identified by BioDiscML were Average two Dependance Estimators (A2DE), Hoeffding Tree (HT), Average 1 Dependance Estimators (A2DE), Voting Features Intervals (VFI), and Naive Bayes (NB). Hyperparameters are described in .
Dimension reduction by Information Gain and ReliefF
| |
| |
| value of |
| |
| with respect to classes |
| |
| |
| |
| |
| |
| |
Identification of features subsets and feature-optimized models
| |
| |
| |
| |
| |
| |
| |
| |
| (see |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| Add |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| Add |
| |
| |
| |
| |
| |
| keep |
| |
| |
| feature to the first selected feature in |
| |
| remove |
| |
| |
| |
| |
| discard |
| |
| |
| keep |
| |
| |
| |
| |
| |
| |
| |
| |
| # create models without stepwise feature subset selection approaches |
| |
| |
| |