| Literature DB >> 33228632 |
Animesh Acharjee1,2,3, Joseph Larkman4,5, Yuanwei Xu4,5, Victor Roth Cardoso4,5,6, Georgios V Gkoutos4,5,7,6,8,9.
Abstract
BACKGROUND: Biomarker identification is one of the major and important goal of functional genomics and translational medicine studies. Large scale -omics data are increasingly being accumulated and can provide vital means for the identification of biomarkers for the early diagnosis of complex disease and/or for advanced patient/diseases stratification. These tasks are clearly interlinked, and it is essential that an unbiased and stable methodology is applied in order to address them. Although, recently, many, primarily machine learning based, biomarker identification approaches have been developed, the exploration of potential associations between biomarker identification and the design of future experiments remains a challenge.Entities:
Keywords: Biomarker; Feature selection; Power study; Random forest
Mesh:
Substances:
Year: 2020 PMID: 33228632 PMCID: PMC7685541 DOI: 10.1186/s12920-020-00826-6
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
Fig. 1Schematic diagram of the simulation set up and the published experimentally derived (real) data analysis
A list of the published datasets used in this study
| RF mode | Dataset type | Sample number (N) | Feature number (p) | Outcome variable | Pubmed ID | References |
|---|---|---|---|---|---|---|
| [ | ||||||
| [ | ||||||
| [ | ||||||
| [ | ||||||
| [ | ||||||
| [ |
For each of the RF models, two datasets was considered and the model outcome was compared with the published results
List of the methods and R packages used
| Method | R package used | References |
|---|---|---|
| Random Forest | randomForest | [ |
| Random Forest (Optimised for memory) | ranger | [ |
| Boruta | Boruta | [ |
| Hyperparameter selection | Caret | [ |
| Permutation based feature selection | pomona | [ |
| Recursive feature elimination (RFE) | vaSelRF | [ |
Fig. 2Results from the simulation study in RF regression mode. a The structure of the simulated predictor data from uniform distribution and the association with outcome variable (y) is described. Only V1-V120 are shown of full dataset featuring 5000 variables. b The number of features stably selected by each approach in at least 5/100 iterations (Low Stringency) or a minimum of 90/100 iterations (High Stringency) are shown. True positive: V1–V30, False positive: V3–V5000. Values describing the number of times each feature is chosen by a particular approach are averaged across those achieved after 100 iterations for each of the four inner loop test datasets. c The variance in predictive accuracy (R-Squared), across all four outer loop cross-validation repeats, is shown for RFs trained using only the high or LS stable features selected by each feature selection approach using the relevant inner loop test dataset
List of the methods, stringency (high and low) and evaluation criteria used for both regression and classification
| RF methods | Stringency | Criteria | Regression | Classification | ||||
|---|---|---|---|---|---|---|---|---|
| Simulation | Metabolomics | Lipidomics | Simulation | Lipidomics | Transcriptomics-1 | |||
| RFE | High | TP/Known | 20 | 1 | 1 | 10 | 0 | – |
| FP/Novel | 0 | 5 | 0 | 0 | 0 | 0 | ||
| Low | TP/Known | 29 | 3 | 3 | 29 | 2 | – | |
| FP/Novel | 19 | 91 | 8 | 201 | 18 | 14 | ||
| Boruta | High | TP/Known | 20 | 3 | 2 | 11 | 2 | – |
| FP/Novel | 0 | 43 | 6 | 0 | 24 | 19 | ||
| Low | TP/Known | 29 | 3 | 3 | 29 | 3 | – | |
| FP/Novel | 1 | 83 | 34 | 9 | 10 | 39 | ||
| Permutation (Raw) | High | TP/Known | 29 | 2 | 2 | 19 | 2 | – |
| FP/Novel | 0 | 24 | 7 | 0 | 10 | 0 | ||
| Low | TP/Known | 29 | 3 | 3 | 29 | 3 | – | |
| FP/Novel | 98 | 68 | 47 | 465 | 35 | 132 | ||
| Permutation (Corrected) | High | TP/Known | 21 | 2 | 1 | 11 | 2 | – |
| FP/Novel | 0 | 1 | 0 | 0 | 6 | 0 | ||
| Low | TP/Known | 29 | 3 | 3 | 29 | 3 | – | |
| FP/Novel | 8 | 48 | 26 | 110 | 46 | 66 | ||
Evaluation criteria are formed by the number of the features identified by each method vs. the known features that are already reported or simulated in the model
Fig. 3Validation model performance and power analysis of published experimentally derived data 1, regression mode. a Boxplots displaying the variance in the observed R-squared value of validation models trained using the stable features selected by each feature selection approach, across four outer-loop CV repeats. Values are shown for models trained using either the features selected by each approach in at least 5/100 iterations (Low Stringency) or a minimum of 90/100 iterations (High Stringency). b The three groups of correlated features identified by the power function are represented by the group member with the largest observed effect size. The effect size of each assessed variable is shown along the y axis and a series of sample sizes along the x axis. Power values determined for each effect/sample size combination using a simulated dataset with the same correlation structure as input data and displayed using variably sized/coloured rhombi
Fig. 4Results from public dataset identified by the module 1 of the workflow is listed above with probability values < 0.05. a Stable metabolic markers and their variance explained with relative liver weight is shown. b Lipids associated with amount of milk in the 3 m old infants are listed
Fig. 5Screenshots of the open-
source web application ‘PowerTools’, for efficient and accessible simulation based power calculations