| Literature DB >> 34976114 |
Hongwei Sun1,2, Jiu Wang1, Zhongwen Zhang1, Naibao Hu1, Tong Wang2.
Abstract
High dimensionality and noise have made it difficult to detect related biomarkers in omics data. Through previous study, penalized maximum trimmed likelihood estimation is effective in identifying mislabeled samples in high-dimensional data with mislabeled error. However, the algorithm commonly used in these studies is the concentration step (C-step), and the C-step algorithm that is applied to robust penalized regression does not ensure that the criterion function is gradually optimized iteratively, because the regularized parameters change during the iteration. This makes the C-step algorithm runs very slowly, especially when dealing with high-dimensional omics data. The AR-Cstep (C-step combined with an acceptance-rejection scheme) algorithm is proposed. In simulation experiments, the AR-Cstep algorithm converged faster (the average computation time was only 2% of that of the C-step algorithm) and was more accurate in terms of variable selection and outlier identification than the C-step algorithm. The two algorithms were further compared on triple negative breast cancer (TNBC) RNA-seq data. AR-Cstep can solve the problem of the C-step not converging and ensures that the iterative process is in the direction that improves criterion function. As an improvement of the C-step algorithm, the AR-Cstep algorithm can be extended to other robust models with regularized parameters.Entities:
Mesh:
Substances:
Year: 2021 PMID: 34976114 PMCID: PMC8716222 DOI: 10.1155/2021/9436582
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
Algorithm 1Description of C-step algorithm.
Algorithm 2Description of AR-Cstep algorithm.
Figure 1Results of MTL-EN, enetLTS, and Ensemble when n = 500 and p = 1000. Sn: sensitivity; FPR: False Positive Rate; PSR: Positive Selection Rate; FDR: False Discovery Rate.
Results of Ensemble for the datasets with n = 500, p = 1,000, and ε = 0.15.
| Data | Model size | PSR | FDR | GM# |
|---|---|---|---|---|
| Original data | 16.06 | 0.533 | 0.003 | 0.714 |
| Subset∗ | 19.79 | 0.644 | 0.022 | 0.786 |
| Subset∗∗ | 21.75 | 0.708 | 0.021 | 0.828 |
∗This subset is the original dataset after removing outliers identified by enetLTS. ∗∗This subset is the original dataset after removing outliers identified by MTL-EN. #GM: the geometric mean of PSR and (1-FDR).
Computation times of MTL-EN and enetLTS for the datasets with n = 500, p = 1,000, and ε = 0.1.
| Methods | Mean(s) |
|---|---|
| enetLTS | 6489.06 |
| MTL-EN | 165.2 |
Number of misclassified observation that detected using enetLTS and Ensemble.
| Method | Identified misclassification | Num of TNBC/non-TNBC∗ | Num of suspect TNBC/non-TNBC∗∗ |
|---|---|---|---|
| enetLTS | 68 | 3/65 | 0/7 |
| MTL-EN | 47 | 12/35 | 1/6 |
∗Number of identified misclassified observations with TNBC/non-TNBC labels. ∗∗Number of identified suspect individuals with inconsistent labels.
Forty-seven misclassified observations detected using MTL-EN for the TNBC dataset#.
| ID | ESR | PGR | HER2 | HER2_level | HER2_status | HER2_FISH |
| Perres |
|---|---|---|---|---|---|---|---|---|
| TCGA-E9-A22G | 0.44 (-) | 0.02 (-) | 15.32 | + | Non-TNBC | 32.54 | ||
| TCGA-A2-A3Y0 | 2.18 (+) | 0.03 (-) | 11.34 | 1+ | - | Non-TNBC | 29.71 | |
| TCGA-A2-A04U | 0.02 (-) | 0.02 (-) | 9.64 | 1+ | - | + | Non-TNBC | 22.86 |
| TCGA-BH-A1EW | 29.98 (-) | 18.90 (-) | 42.47 | - | TNBC | 18.89 | ||
| TCGA-GM-A2DI | 23.49 (-) | 12.05 (-) | 20.30 | - | TNBC | 14.73 | ||
| TCGA-S3-AA0Z | 16.67 (+) | 0.07 (+) | 33.07 | 1+ | Equiv | - | Non-TNBC | 14.58 |
| TCGA-AN-A0FJ | 0.08 (+) | 0.04 (-) | 14.28 | 1+ | + | Non-TNBC | 14.03 | |
| TCGA-BH-A5IZ | 5.12 (+) | 0.03 (-) | 28.08 | - | - | Non-TNBC | 13.87 | |
| TCGA-OL-A5S0 | 0.09 (+) | 0.06 (-) | 31.92 | + | Non-TNBC | 13.45 | ||
| TCGA-E9-A1ND | 1.44 (-) | 0.05 (-) | 13.05 | + | Non-TNBC | 13.05 | ||
| TCGA-B6-A0IJ | 1.18 (+) | 0.46 (+) | 11.12 | Non-TNBC | 11.92 | |||
| TCGA-AR-A251 | 1.57 (+) | 0.10 (-) | 14.02 | 2+ | Equiv | - | Non-TNBC | 10.70 |
| TCGA-D8-A1JM | 5.00 (+) | 0.01 (-) | 21.85 | 1+ | - | Non-TNBC | 10.52 | |
| TCGA-E2-A1II | 0.14 (-) | 0.19 (+) | 10.73 | 1+ | - | Non-TNBC | 10.51 | |
| TCGA-A2-A1G6∗ | 23.90 (-) | 21.45 (-) | 29.74 | 1+ | - | TNBC | 9.62 | |
| TCGA-A2-A0YJ | 0.09 (+) | 0.03 (-) | 240.24 | 0 | - | Non-TNBC | 9.52 | |
| TCGA-LL-A5YP | 0.16 (+) | 0.05 (-) | 15.10 | 1+ | - | + | Non-TNBC | 9.23 |
| TCGA-E9-A1NC | 0.11 (-) | 0.07 (+) | 15.91 | + | Non-TNBC | 8.98 | ||
| TCGA-AC-A62X | 0.19 (+) | 0.02 (-) | 28.53 | Non-TNBC | 8.93 | |||
| TCGA-A7-A13E | 0.82 (+) | 0.06 (-) | 46.08 | 2+ | Equiv | - | Non-TNBC | 8.77 |
| TCGA-C8-A3M7 | 4.27 (-) | 0.76 (-) | 25.47 | - | TNBC | 8.71 | ||
| TCGA-AR-A1AJ | 1.47 (+) | 0.07 (-) | 9.74 | - | Non-TNBC | 8.53 | ||
| TCGA-BH-A0DL | 6.99 (+) | 0.04 (-) | 9.92 | - | Non-TNBC | 7.85 | ||
| TCGA-E2-A1L7∗ | 29.61 (-) | 22.98 (-) | 10.33 | - | TNBC | 7.35 | ||
| TCGA-AR-A1AH | 0.03 (+) | 0.03 (-) | 34.12 | - | Non-TNBC | 7.31 | ||
| TCGA-E2-A14Y | 0.67 (+) | 0.03 (+) | 487.90 | 2+ | Equiv | + | Non-TNBC | 7.11 |
| TCGA-LL-A8F5 | 1.08 (+) | 0.04 (-) | 11.86 | 1+ | - | Non-TNBC | 6.96 | |
| TCGA-OL-A97C∗ | 16.25 (-) | 8.56 (-) | 24.04 | - | TNBC | 6.86 | ||
| TCGA-A7-A13D | 0.52 (-) | 0.81 (+) | 42.28 | 2+ | Equiv | - | Non-TNBC | 6.73 |
| TCGA-AR-A0TP | 0.04 (+) | 0.03 (-) | 13.39 | - | Non-TNBC | 6.53 | ||
| TCGA-LL-A6FR | 0.33 (-) | 0.04 (+) | 32.13 | 2+ | Equiv | + | Non-TNBC | 6.19 |
| TCGA-A2-A25F | 0.62 (-) | 0.23 (+) | 5.19 | - | Non-TNBC | 5.86 | ||
| TCGA-AO-A0JL | 0.63 (-) | 0.08 (-) | 63.60 | 1+ | - | + | Non-TNBC | 5.45 |
| TCGA-A2-A1G1 | 0.53 (-) | 0.17 (-) | 819.76 | 2+ | Equiv | + | Non-TNBC | 5.28 |
| TCGA-BH-A42U∗ | 9.19 (-) | 1.83 (-) | 38.37 | - | TNBC | 4.99 | ||
| TCGA-AN-A0FX | 1.13 (-) | 0.64 (-) | 24.02 | 1+ | + | Non-TNBC | 4.75 | |
| TCGA-D8-A1XW | 0.32 (-) | 0.11 (+) | 21.03 | 1+ | - | Non-TNBC | 4.57 | |
| TCGA-AR-A24Q | 1.00 (+) | 0.36 (-) | 20.67 | - | Non-TNBC | 4.52 | ||
| TCGA-A1-A0SB | 3.16 (+) | 0.03 (-) | 32.35 | - | Non-TNBC | 4.47 | ||
| TCGA-A2-A4RX | 0.68 (+) | 0.93 (+) | 26.64 | 1+ | - | Non-TNBC | 3.18 | |
| TCGA-AN-A0FL | 0.09 (-) | 1.07 (-) | 15.07 | 1+ | + | Non-TNBC | 3.01 | |
| TCGA-A2-A0EQ∗ | 2.13 (-) | 0.04 (-) | 30.15 | 3+ | + | - | TNBC | 2.63 |
| TCGA-EW-A1OV∗ | 0.23 (-) | 0.03 (-) | 28.91 | - | - | TNBC | 1.83 | |
| TCGA-OL-A5D6∗ | 0.35 (-) | 0.20 (-) | 72.13 | - | TNBC | 1.69 | ||
| TCGA-C8-A26X∗ | 0.42 (-) | 0.13 (-) | 60.12 | 1+ | - | TNBC | 1.62 | |
| TCGA-LL-A740∗ | 0.30 (-) | 0.12 (-) | 68.56 | 2+ | Equiv | - | TNBC | 1.48 |
| TCGA-BH-A6R9 | 0.59 (-) | 0.25 (+) | 8.18 | - | Non-TNBC | 0.99 |
#Including the expression values, IHC, and FISH tests of ER, PR, and HER2 (individuals highlighted in bold are suspect individuals). ∗Outliers detected by MTL-EN but not by enetLTS. ∗∗Perres: the abstract value of Pearson residual.
Top 40 genes selected by MTL-EN for the TNBC dataset.
| Upregulated | COX7B2 (0.14), LBP (0.12), SLC15A1 (0.11), B3GNT5 (0.10), A2ML1 (0.10), FOXC1 (0.09), COL9A3 (0.09), KRT16 (0.09), FDCSP (0.09), FABP7 (0.09), AADAT (0.09), VSNL1 (0.09), KLK6 (0.09), PPP1R14C (0.08), GZMB (0.07), CCNE1 (0.07), FAM171A1 (0.07) |
| Downregulted | AGR3 (-0.24), CA12 (-0.20), AGR2 (-0.19), MLPH (-0.17), ESR1 (-0.15), TBC1D9 (-0.13), FOXA1 (-0.12), TFF1 (-0.12), ERBB2 (-0.11), GRB7 (-0.10), STARD3 (-0.10), PGAP3 (-0.10), TFF3 (-0.10), CXXC5 (-0.10), GATA3 (-0.10), ACOX2 (-0.09), ASPN (-0.09), MIEN1 (-0.08), SPDEF (-0.08), CHAD (-0.08), EEF1A2 (-0.08), CMBL (-0.08), SRARP (-0.07) |
Results of Ensemble three models for the original TNBC data and subset with outliers removed.
| Dataset | EN | SPLS-DA | SGPLS | |||
|---|---|---|---|---|---|---|
| Model size∗∗ | MR# | Model size | MR | Model size | MR | |
| Original data | 175 | 0.012 | 22 | 0.064 | 33 | 0.059 |
| Subset∗ | 83 | 0.000 | 87 | 0.008 | 16 | 0.015 |
| Subset## | 49 | 0.001 | 38 | 0.014 | 55 | 0.013 |
∗This subset is the original dataset after removing 68 outliers identified by enetLTS. ##This subset is the original dataset after removing 47 outliers identified by MTL-EN. ∗∗Model size: number of variables; #MR: misclassification rate.
Genes selected by Ensemble for the TNBC subset∗.
| FOXC1, ESR1, AGR2, FOXA1, TFF3, TFF1, KLK6, AGR3, FDCSP, KRT6B, KRT16, PPP1R14C |
∗This subset is the original dataset after removing 47 outliers identified by MTL-EN.