| Literature DB >> 34809631 |
Sunwoo Han1, Brian D Williamson1, Youyi Fong2.
Abstract
BACKGROUND: While random forests are one of the most successful machine learning methods, it is necessary to optimize their performance for use with datasets resulting from a two-phase sampling design with a small number of cases-a common situation in biomedical studies, which often have rare outcomes and covariates whose measurement is resource-intensive.Entities:
Keywords: Case–control design; Class imbalance; HIV vaccine; Variable screening
Mesh:
Year: 2021 PMID: 34809631 PMCID: PMC8607560 DOI: 10.1186/s12911-021-01688-3
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 3.298
Fig. 1A flow diagram for the random forest algorithm in the context of classification
Comparison of CV-AUCs obtained by standard random forest (RF), random forest with under-sampling (RF_under), random forest with over-sampling (RF_over), and random forest with inverse sampling probability weights (RF_ipw), including results obtained without variable screening and results obtained with variable screening
| No screening | Screening | |||||||
|---|---|---|---|---|---|---|---|---|
| RF | RF_under | RF_over | RF_ipw | RF | RF_under | RF_over | RF_ipw | |
| All markers | 0.679 | 0.732 | 0.711 | 0.657 | 0.824 | 0.806 | 0.806 | 0.824 |
| T cell markers | 0.718 | 0.714 | 0.715 | 0.708 | 0.812 | 0.780 | 0.799 | 0.819 |
| Antibody markers | 0.605 | 0.656 | 0.628 | 0.579 | 0.708 | 0.722 | 0.696 | 0.711 |
| No markers | 0.442 | 0.452 | 0.448 | 0.443 | 0.442 | 0.452 | 0.448 | 0.443 |
Clinical covariates (age, BMI, and a risk behavior score) are always included
Comparison of CV-AUC of standard random forest (RF) and tuned random forest (tRF). Screening is applied to both methods, but not class balancing or inverse sampling probability weighting
| RF | tRF | |
|---|---|---|
| All markers | 0.824 | 0.807 |
| T cell markers | 0.812 | 0.802 |
| Antibody markers | 0.708 | 0.721 |
| No markers | 0.442 | 0.455 |
Clinical covariates (age, BMI, and a risk behavior score) are always included
Comparison of CV-AUCs of generalized linear models (GLM) and random forest (RF), with and without clinical covariates
| GLM | RF | |||
|---|---|---|---|---|
| No covariates | Covariates | No covariates | Covariates | |
| All markers | 0.810 | 0.813 | 0.808 | 0.824 |
| T cell markers | 0.781 | 0.793 | 0.806 | 0.819 |
| Antibody markers | 0.759 | 0.768 | 0.729 | 0.711 |
| No markers | 0.500a | 0.624 | 0.500* | 0.443 |
Screening is applied for both GLM and RF, but class balancing is not done for RF. Inverse sampling probability weights are used in both GLM and RF training
aDenotes theoretical values
Comparison of CV-AUCs of four stacking models and two random forest models
| CV-AUC | |
|---|---|
| RF: T cell markers + GLM: antibody markers | 0.838 |
| RF: T cell markers + GLM: All markers | 0.831 |
| RF: T cell markers | 0.819 |
| RF: All markers + GLM: ntibody markers | 0.821 |
| RF: All markers + GLM: All markers | 0.821 |
| RF: All markers | 0.824 |
Screening is applied for both GLM and RF, but class balancing is not done for RF. Inverse sampling probability weights are used in both GLM and RF training. Clinical covariates are included in the predictors of all RF and GLM models
Fig. 2Top: boxplots of three prediction scores by cases and controls from one 5-fold cross validation. Bottom: scatterplots of these prediction scores. Cases are shown in red, and controls are shown in black. Study volunteers 180 and 183 are plotted as triangles