| Literature DB >> 35072570 |
Antje Jensch1, Marta B Lopes2,3, Susana Vinga4,5, Nicole Radde1.
Abstract
The extraction of novel information from omics data is a challenging task, in particular, since the number of features (e.g. genes) often far exceeds the number of samples. In such a setting, conventional parameter estimation leads to ill-posed optimization problems, and regularization may be required. In addition, outliers can largely impact classification accuracy.Here we introduce ROSIE, an ensemble classification approach, which combines three sparse and robust classification methods for outlier detection and feature selection and further performs a bootstrap-based validity check. Outliers of ROSIE are determined by the rank product test using outlier rankings of all three methods, and important features are selected as features commonly selected by all methods.We apply ROSIE to RNA-Seq data from The Cancer Genome Atlas (TCGA) to classify observations into Triple-Negative Breast Cancer (TNBC) and non-TNBC tissue samples. The pre-processed dataset consists of 16,600 genes and more than 1,000 samples. We demonstrate that ROSIE selects important features and outliers in a robust way. Identified outliers are concordant with the distribution of the commonly selected genes by the three methods, and results are in line with other independent studies. Furthermore, we discuss the association of some of the selected genes with the TNBC subtype in other investigations. In summary, ROSIE constitutes a robust and sparse procedure to identify outliers and important genes through binary classification. Our approach is ad hoc applicable to other datasets, fulfilling the overall goal of simultaneously identifying outliers and candidate disease biomarkers to the targeted in therapy research and personalized medicine frameworks.Entities:
Keywords: Ensemble; biomarker; classification; feature selection; outlier; robust; sparse; triple-Negative Breast Cancer
Mesh:
Year: 2022 PMID: 35072570 PMCID: PMC9014683 DOI: 10.1177/09622802211072456
Source DB: PubMed Journal: Stat Methods Med Res ISSN: 0962-2802 Impact factor: 2.494
Figure 1.ROSIE workflow and robust and sparse classification methods. A) Three robust and sparse methods perform classification on the dataset. Each method provides an outlier ranking and selected features. Rankings are combined to acquire an outlier list. Important features are taken as the intersection of all three selected feature sets. Validity of the method is assessed by repeatedly classifying bootstrap sampled datasets and comparing the results with the main part. B) Simplified representation of the underlying classification methods, i.e., sparse robust discriminant analysis with sparse partial robust M regression (SPRM), robust and sparse K-means clustering (RSK-means) and robust and sparse logistic regression with elastic net penalty (enetLTS) for exemplary data comprising two classes and two features ( ).
Figure 2.ROC curves for simulation study results. Results comparing ROSIE with single methods for three outlier settings. Average AUC values: ROSIE (0.81), ENET (0.79), SPRM (0.76), RSKC (0.65).
Summary of classification results. Number of selected features and number of misclassifications for SPRM, RSK-means and enetLTS.
| SPRM | RSK-means | enetLTS | |
|---|---|---|---|
| # of selected genes |
|
|
|
| Misclassifications |
|
|
|
Summary for influential samples found by Ensemble procedure. Shown are acquired ranks per method, Rank Product (RP), statistical - and -values, misclassification percentage and percentage of significant -values in bootstrap runs. Suspect cases are marked with an asterisk (*). All influential samples were repeatedly selected as influential in all bootstrap runs they were included in.
| SPRM | RSK-m | enetLTS | RP | miscl. rate | |||
|---|---|---|---|---|---|---|---|
| TCGA-E9-A22G |
|
|
|
|
|
|
|
| TCGA-A2-A0YJ |
|
|
|
|
|
|
|
| TCGA-A2-A4S1 |
|
|
|
|
|
|
|
| TCGA-A7-A13E |
|
|
|
|
|
|
|
| TCGA-A2-A04U * |
|
|
|
|
|
|
|
| TCGA-LL-A6FR |
|
|
|
|
|
|
|
| TCGA-AR-A0TP |
|
|
|
|
|
|
|
| TCGA-AR-A251 |
|
|
|
|
|
|
|
| TCGA-AN-A0FJ * |
|
|
|
|
|
|
|
| TCGA-OL-A5S0 |
|
|
|
|
|
|
|
| TCGA-AN-A0FL * |
|
|
|
|
|
|
|
Figure 3.Correlation analysis of selected features. Heatmap of correlation values of the commonly selected features.
Figure 4.Relation between influential samples and commonly selected genes. Estimated densities of gene expression of selected features grouped by TNBC (green dashed line) and non-TNBC (red line). Vertical lines represent respective group medians. Blue markers depict influential samples.
Figure 5.Venn diagrams comparing different classification approaches. Comparison of identified outliers (left) and selected genes (right) from ROSIE, the sparse Ensemble approach by Lopes et al. and the robust approach enetLTS by Segaert et al. .