| Literature DB >> 29728051 |
Marta B Lopes1, André Veríssimo1, Eunice Carrasquinha1, Sandra Casimiro2, Niko Beerenwinkel3,4, Susana Vinga5,6.
Abstract
BACKGROUND: Learning accurate models from 'omics data is bringing many challenges due to their inherent high-dimensionality, e.g. the number of gene expression variables, and comparatively lower sample sizes, which leads to ill-posed inverse problems. Furthermore, the presence of outliers, either experimental errors or interesting abnormal clinical cases, may severely hamper a correct classification of patients and the identification of reliable biomarkers for a particular disease. We propose to address this problem through an ensemble classification setting based on distinct feature selection and modeling strategies, including logistic regression with elastic net regularization, Sparse Partial Least Squares - Discriminant Analysis (SPLS-DA) and Sparse Generalized PLS (SGPLS), coupled with an evaluation of the individuals' outlierness based on the Cook's distance. The consensus is achieved with the Rank Product statistics corrected for multiple testing, which gives a final list of sorted observations by their outlierness level.Entities:
Keywords: Ensemble modeling; High-dimensionality; Outlier detection; Rank Product test; Triple-negative breast cancer
Mesh:
Year: 2018 PMID: 29728051 PMCID: PMC5936001 DOI: 10.1186/s12859-018-2149-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Correspondence (number of cases) between the HER2 classification of individuals by IHC level and status, and FISH, obtained from the BRCA clinical data (individuals with non-concordance for HER2 classification by different testing (‘positive’ vs. ‘negative’) are highlighted in bold)
| HER2 (IHC) | |||||||
|---|---|---|---|---|---|---|---|
| ‘’ | Equivocal | Indeterminate | Negative | Positive | Total | ||
| HER2 (IHC) level | ‘’ | 176 | 7 | 6 | 241 | 41 | 471 |
| 0 (negative) | 0 | 0 | 0 | 60 | 0 | 60 | |
| 1+ (negative) | 0 | 4 | 1 | 255 |
| 271 | |
| 2+ (indeterminate) | 0 | 166 | 4 | 1 | 27 | 198 | |
| 3+ (positive) | 1 | 1 | 1 |
| 85 | 90 | |
| Total | 177 | 178 | 12 | 559 | 164 | 1090 | |
| HER2 (FISH) | ‘’ | 115 | 18 | 4 | 432 | 104 | 673 |
| equivocal | 0 | 3 | 0 | 0 | 2 | 5 | |
| indeterminate | 0 | 0 | 0 | 3 | 1 | 4 | |
| negative | 53 | 138 | 6 | 121 |
| 330 | |
| positive | 9 | 19 | 2 |
| 45 | 78 | |
| Total | 177 | 178 | 12 | 559 | 164 | 1090 | |
Individuals with discordant HER2 (IHC) status and level, not measured by FISH (individuals not expressing ER and PR, and without a FISH classification are highlighted in bold)
| Individual | ER | PR | HER2 | HER2 level (IHC) | HER2 | Type |
|---|---|---|---|---|---|---|
| TCGA-AC-A8OS | 94.72(+) | 1.14(+) | 23.55 | 1+ | + | non-TNBC |
|
|
|
|
|
|
|
|
|
∗
|
|
|
|
|
|
|
|
∗
|
|
|
|
|
|
|
| TCGA-AN-A0FK | 128.26(+) | 31.59(+) | 25.62 | 1+ | + | non-TNBC |
| TCGA-E9-A295 | 17.36(+) | 6.80(+) | 34.83 | 1+ | + | non-TNBC |
| ∗TCGA-AC-A3YI | 5.65(+) | 0.76(+) | 60.87 | 1+ | + | non-TNBC |
| TCGA-JL-A3YW | 0.35(+) | 0.09(+) | 31.47 | 1+ | + | non-TNBC |
| TCGA-AN-A0FS | 91.87(+) | 1.16(-) | 43.92 | 1+ | + | non-TNBC |
| TCGA-AN-A0FN | 21.34(+) | 1.14(+) | 17.50 | 1+ | + | non-TNBC |
| ∗TCGA-AN-A0FJ | 0.08(+) | 0.04(-) | 14.28 | 1+ | + | non-TNBC |
| TCGA-AC-A3W6 | 28.89(+) | 0.26(+) | 19.24 | 3+ | - | non-TNBC |
| TCGA-AN-A03X | 0.75(+) | 12.28(+) | 38.03 | 1+ | + | non-TNBC |
Individuals marked with asterisks show no concordance regarding HER2 labeling by different testing and are misclassified by logistic regression based on the 3 variables clinically used to classify breast cancer patients into TNBC
Individuals with discordant HER2 (IHC) status and HER2 (FISH) classification (individuals not expressing ER and PR are highlighted in bold)
| Individual | ER | PR | HER2 | HER2 | HER2 | HER2 | HER2 | Type |
|---|---|---|---|---|---|---|---|---|
| (IHC) | (IHC) | (FISH) | (IHC + FISH) | |||||
| ∗TCGA-LL-A5YP | 0.16(+) | 0.05(-) | 15.10 | 1+ | - | + | + | non-TNBC |
|
∗
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
∗
|
|
|
|
|
|
|
|
|
| TCGA-AN-A0XV | 110.33(+) | 3.50(+) | 22.17 | 2+ | + | - | - | non-TNBC |
| TCGA-AO-A12C | 18.88(+) | 3.60(+) | 95.67 | 3+ | + | - | - | non-TNBC |
|
∗
|
|
|
|
|
|
|
| |
| TCGA-LL-A5YL | 9.84(+) | 0.52(-) | 55.32 | 3+ | + | - | - | non-TNBC |
| TCGA-E2-A10A | 59.40(+) | 9.89(+) | 32.99 | 2+ | + | - | - | non-TNBC |
| TCGA-AO-A03L | 14.71(+) | 1.35(+) | 58.72 | 2+ | + | - | - | non-TNBC |
| TCGA-AO-A12G | 40.39(+) | 1.80(+) | 45.03 | 2+ | + | - | - | non-TNBC |
| TCGA-BH-A1EX | 5.50(+) | 0.28(+) | 41.46 | + | - | - | non-TNBC | |
| TCGA-BH-A0AU | 37.14(+) | 16.19(+) | 51.79 | + | - | - | non-TNBC | |
|
∗
|
|
|
|
|
|
|
|
|
| TCGA-AN-A0XW | 13.96(+) | 5.32(+) | 22.04 | 2+ | + | - | - | non-TNBC |
Individuals marked with asterisks show no concordance regarding HER2 labeling by different testing and are misclassified by logistic regression based on the 3 variables clinically used to classify breast cancer patients into TNBC
Summary of FPKM values obtained for ER, PR and HER2 for the individuals under study
| Class | Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max | |
|---|---|---|---|---|---|---|---|
| ER | 0 | 0.016 | 16.144 | 36.667 | 47.881 | 69.649 | 272.203 |
| 1 | 0.019 | 0.160 | 0.351 | 1.530 | 0.828 | 29.979 | |
| PR | 0 | 0.008 | 0.600 | 4.228 | 12.012 | 15.326 | 327.913 |
| 1 | 0.001 | 0.040 | 0.079 | 0.712 | 0.186 | 22.978 | |
| HER2 | 0 | 0.605 | 26.580 | 38.732 | 99.741 | 58.801 | 1668.353 |
| 1 | 1.561 | 13.964 | 19.776 | 21.991 | 26.058 | 103.68 |
Ensemble outlier detection results for the TNBC dataset (mean values for the number of variables selected, MSE and misclassifications for the random strategies are presented)
| TNBC original data | Random patients | Random variables | |||
|---|---|---|---|---|---|
| LOGIT-EN | SPLS-DA | SGPLS | LOGIT-EN | ||
| Variables selected | 107 | 2945 | 551 | 82 | 65 |
| MSE | 0.020 | 0.025 | 0.084 | 0.032 | 0.035 |
| Misclassifications | 16 | 29 | 23 | 35 | 41 |
| Parameter ( | 0.9 | 0.8 | 0.7 | 0.7 | 0.7 |
| Parameter ( | - | 4 | 4 | - | - |
| Resampling number | - | 100 | 100 | ||
| Influential observations | 24 | 40 | 37 | ||
| Suspect (influential) observations | 2 | 6 | 6 | ||
Fig. 1Individuals’ distributions in the space spanned by the first two SPLS-DA latent vectors. Circles, non-TNBC individuals; triangles, TNBC individuals; blue data points are influential observations; red data points are influential observations which are suspect regarding their HER2 label
Summary of the 24 individuals identified as influential by the RP test applied to the original TNBC data (individuals highlighted in bold are suspect individuals; asterisks refer to individuals identified as influential for both the original TNBC data and after resampling samples or variables)
| Individual | ER | PR | HER2 | HER2 level | HER2 | HER2 | Type | LOGIT-EN | SPLS-DA | SGPLS | RP | q-value | Miscl. | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| (IHC) | (IHC) | (FISH) | (%) | |||||||||||
| ∗TCGA-AC-A2QJ | 0.10(-) | 0.00(-) | 7.35 | - | TNBC | 13 | 1 | 34 | 442 | 1.40e-05 | 0.006 | 67 | ||
| ∗TCGA-E9-A22G | 0.44(-) | 0.02(-) | 15.32 | + | non-TNBC | 6 | 13 | 7 | 546 | 1.83-05 | 0.006 | 33 | ||
| ∗TCGA-AR-A251 | 1.57(+) | 0.10(-) | 14.02 | 2+ | Equiv | - | non-TNBC | 11 | 5 | 9 | 495 | 1.61e-05 | 0.006 | 67 |
| ∗TCGA-AR-A1AJ | 1.47(+) | 0.07(-) | 9.74 | - | non-TNBC | 2 | 10 | 44 | 880 | 3.35e-05 | 0.007 | 100 | ||
| ∗TCGA-A2-A3Y0 | 2.18(+) | 0.03(-) | 11.34 | 1+ | - | non-TNBC | 8 | 7 | 22 | 1232 | 5.01e-05 | 0.009 | 33 | |
| TCGA-A2-A3XV | 0.02(+) | 0.03(-) | 137.94 | 2+ | Equiv | + | non-TNBC | 166 | 2 | 5 | 1660 | 7.09e-05 | 0.011 | 33 |
| ∗TCGA-E9-A1ND | 1.44(-) | 0.05(-) | 13.05 | + | non-TNBC | 5 | 11 | 48 | 2640 | 1.20e-04 | 0.016 | 100 | ||
| TCGA-EW-A1P1 | 3.02(-) | 2.56(-) | 23.64 | 2+ | Equiv | - | TNBC | 16 | 24 | 8 | 3072 | 1.43e-04 | 0.016 | 100 |
| ∗TCGA-E2-A1II | 0.14(-) | 0.19(+) | 10.73 | 1+ | - | non-TNBC | 9 | 21 | 20 | 3780 | 1.80e-04 | 0.017 | 33 | |
| ∗TCGA-C8-A3M7 | 4.27(-) | 0.76(-) | 25.47 | - | TNBC | 1 | 28 | 153 | 4284 | 2.06e-04 | 0.017 | 100 | ||
| ∗TCGA-D8-A1JF | 1.26(-) | 0.11(-) | 32.93 | 1+ | - | TNBC | 19 | 109 | 2 | 4142 | 1.99e-04 | 0.017 | 0 | |
| TCGA-BH-A42U | 9.19(-) | 1.83(-) | 38.37 | - | TNBC | 7 | 3 | 223 | 4683 | 2.28e-04 | 0.018 | 100 | ||
| ∗TCGA-LL-A740 | 0.30(-) | 0.12(-) | 68.56 | 2+ | Equiv | - | TNBC | 47 | 125 | 1 | 5875 | 2.91e-04 | 0.021 | 0 |
| ∗TCGA-A2-A1G6 | 23.90(-) | 21.45(-) | 29.74 | 1+ | - | TNBC | 10 | 4 | 204 | 8160 | 4.14e-04 | 0.025 | 100 | |
| ∗TCGA-OL-A5S0 | 0.09(+) | 0.06(-) | 31.92 | + | non-TNBC | 46 | 8 | 21 | 7728 | 3.91e-04 | 0.026 | 67 | ||
|
∗
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| ∗TCGA-A2-A0YJ | 0.09(+) | 0.03(-) | 240.24 | 0 | - | non-TNBC | 27 | 34 | 12 | 11016 | 5.68e-04 | 0.031 | 0 | |
| ∗TCGA-AR-A1AH | 0.03(+) | 0.03(-) | 34.12 | - | non-TNBC | 21 | 16 | 38 | 12768 | 6.63e-04 | 0.034 | 100 | ||
| ∗TCGA-AC-A62X | 0.19(+) | 0.02(-) | 28.53 | non-TNBC | 22 | 35 | 19 | 14630 | 7.63e-04 | 0.037 | 0 | |||
| TCGA-E2-A1LB | 0.42(-) | 0.09(-) | 1129.87 | 3+ | + | non-TNBC | 230 | 6 | 11 | 15180 | 7.93e-04 | 0.037 | 33 | |
| ∗TCGA-OL-A97C | 16.25(-) | 8.56(-) | 24.04 | - | TNBC | 14 | 41 | 30 | 17220 | 9.02e-04 | 0.040 | 33 | ||
|
∗
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| TCGA-C8-A26X | 0.42(-) | 0.13(-) | 60.12 | 1+ | - | TNBC | 53 | 64 | 6 | 20352 | 1.07e-03 | 0.043 | 33 | |
| TCGA-D8-A1XW | 0.2(-) | 0.11(+) | 21.03 | 1+ | - | non-TNBC | 23 | 14 | 68 | 21896 | 1.15e-03 | 0.044 | 33 |
Fig. 2Individuals’ distributions in the space spanned by the first two Principal Components. a symbols correspond to actual labels: circles, non-TNBC individuals; triangles, TNBC individuals; blue data points are influential observations; red data points are influential observations which are suspect regarding their HER2 label. b symbols correspond to predicted labels by the EM algorithm: circles, non-TNBC individuals; triangles, TNBC individuals; red data points are actual non-TNBC observations, for which at least one of the 3 TNBC-associated genes has an arguably high expression value
List of up and down-regulated genes in TNBC selected in common by the modeling strategies evaluated on the original TNBC data (variables selected in common with the ‘random patients’ strategy are highlighted in bold)
| Up-regulated |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
|
| |
| Down-regulated |
|
|
|
|
|