| Literature DB >> 30146936 |
Pieter Segaert1, Marta B Lopes2, Sandra Casimiro3, Susana Vinga2,4, Peter J Rousseeuw1.
Abstract
Correct classification of breast cancer subtypes is of high importance as it directly affects the therapeutic options. We focus on triple-negative breast cancer which has the worst prognosis among breast cancer types. Using cutting edge methods from the field of robust statistics, we analyze Breast Invasive Carcinoma transcriptomic data publicly available from The Cancer Genome Atlas data portal. Our analysis identifies statistical outliers that may correspond to misdiagnosed patients. Furthermore, it is illustrated that classical statistical methods may fail to identify outliers due to their heavy influence, prompting the need for robust statistics. Using robust sparse logistic regression we obtain 36 relevant genes, of which ca. 60% have been previously reported as biologically relevant to triple-negative breast cancer, reinforcing the validity of the method. The remaining 14 genes identified are new potential biomarkers for triple-negative breast cancer. Out of these, JAM3, SFT2D2, and PAPSS1 were previously associated to breast tumors or other types of cancer. The relevance of these genes is confirmed by the new DetectDeviatingCells outlier detection technique. A comparison of gene networks on the selected genes showed significant differences between triple-negative breast cancer and non-triple-negative breast cancer data. The individual role of FOXA1 in triple-negative breast cancer and non-triple-negative breast cancer, and the strong FOXA1-AGR2 connection in triple-negative breast cancer stand out. The goal of our paper is to contribute to the breast cancer/triple-negative breast cancer understanding and management. At the same time it demonstrates that robust regression and outlier detection constitute key strategies to cope with high-dimensional clinical data such as omics data.Entities:
Keywords: Logistic regression; cellwise outliers; gene networks; sparsity
Mesh:
Substances:
Year: 2018 PMID: 30146936 PMCID: PMC6745616 DOI: 10.1177/0962280218794722
Source DB: PubMed Journal: Stat Methods Med Res ISSN: 0962-2802 Impact factor: 3.021
Figure 1.Illustration of the cellwise outlier paradigm versus the typical outlier paradigm.
Summary of the fitted models for the robust and non-robust sparse logistic regression methods.
| Sparse logistic regression | Robust sparse logistic regression | |
|---|---|---|
|
| 1.00 | 0.81 |
|
| 0.005 | 0.057 |
| 136 | 36 | |
| Potential outliers | 0 | 43 |
Summary of the 43 individuals identified as outliers by robust sparse logistic regression regarding ER, PR, and HER2 gene expression and corresponding clinical label (within parentheses).
| Genes | ||||||||
|---|---|---|---|---|---|---|---|---|
| ER | PR | HER2 | ||||||
| FPKM (clinical) | FPKM (clinical) | FPKM | (clinical) | |||||
| Individual | Clinical type | |||||||
| 1 | TCGA-AR-A1AO | 1.47(+) | 1.13(−) | 14.89 | (−) | (−) | non-TNBC | |
| 2 | TCGA-BH-A6R9 | 0.59(−) | 0.25(+) | 8.18 | (−) | non-TNBC | ||
| 3 | TCGA-AC-A62X | 0.19(+) | 0.02(−) | 28.53 | non-TNBC | |||
| 4 | TCGA-A2-A0YJ | 0.09(+) | 0.03(−) | 240.24 | (−) | (−) | non-TNBC | |
| 5 |
|
|
| (−) | (−) | ( |
| |
| 6 | TCGA-A7-A13D | 0.52(−) | 0.81(+) | 42.28 | (Ind) | (Equiv) | (−) | non-TNBC |
| 7 | TCGA-E2-A1II | 0.14(−) | 0.19(+) | 10.73 | (−) | (−) | non-TNBC | |
| 8 | TCGA-AR-A1AH | 0.03(+) | 0.03(−) | 34.12 | (−) | non-TNBC | ||
| 9 | TCGA-BH-A0DL | 6.99(+) | 0.04(−) | 9.92 | (−) | non-TNBC | ||
| 10 | TCGA-E2-A14Y | 0.67(+) | 0.03(+) | 487.90 | (Ind) | (Equiv) | (+) | non-TNBC |
| 11 |
|
| (−) | (−) | ( |
| ||
| 12 |
|
| (−) | ( |
| |||
| 13 | TCGA-AO-A1KO | 10.78(+) | 9.12(+) | 14.91 | (−) | (−) | non-TNBC | |
| 14 |
|
| (−) | ( |
| |||
| 15 | TCGA-A1-A0SB | 3.16(+) | 0.03(−) | 32.35 | (−) | non-TNBC | ||
| 16 | TCGA-D8-A1JM | 5.01(+) | 0.01(−) | 21.85 | (−) | (−) | non-TNBC | |
| 17 | TCGA-E9-A1NC | 0.11(−) | 0.08(+) | 15.91 | (+) | non-TNBC | ||
| 18 | TCGA-A2-A25F | 0.62(−) | 0.23(+) | 5.19 | (−) | non-TNBC | ||
| 19 | TCGA-A2-A1G1 | 0.53(−) | 0.17(−) | 819.76 | (Ind) | (Equiv) | (+) | non-TNBC |
| 20 | TCGA-LL-A6FR | 0.33(−) | 0.04(+) | 32.13 | (Ind) | (Equiv) | (+) | non-TNBC |
| 21 | TCGA-A2-A3Y0 | 2.18(+) | 0.03(−) | 11.34 | (−) | (−) | non-TNBC | |
| 22 | TCGA-B6-A0IJ | 1.18(+) | 0.46(+) | 11.12 | non-TNBC | |||
| 23 | TCGA-AR-A0TP | 0.04(+) | 0.03(−) | 13.39 | (−) | non-TNBC | ||
| 24 | TCGA-S3-AA0Z | 16.67(+) | 0.07(+) | 33.07 | (−) | (Equiv) | (−) | non-TNBC |
| 25 | TCGA-A2-A4S1 | 0.29(+) | 0.01(−) | 0.61 | (−) | non-TNBC | ||
| 26 | TCGA-A7-A13E | 0.82(+) | 0.06(−) | 46.08 | (Ind) | (Equiv) | (−) | non-TNBC |
| 27 | TCGA-D8-A1JK | 0.40(−) | 0.72(+) | 22.19 | (−) | (−) | non-TNBC | |
| 28 | TCGA-E9-A1ND | 1.44(−) | 0.05(−) | 13.05 | (+) | non-TNBC | ||
| 29 |
|
|
|
| (−) |
|
| |
| 30 |
|
|
| (−) | ( |
| ||
| 31 | TCGA-D8-A1XW | 0.32(−) | 0.11(+) | 21.03 | (−) | (−) | non-TNBC | |
| 32 | TCGA-UU-A93S | 0.30(−) | 0.12(−) | 1668.35 | (+) | (+) | non-TNBC | |
| 33 | TCGA-OL-A5S0 | 0.09(+) | 0.06(−) | 31.92 | (+) | non-TNBC | ||
| 34 | TCGA-E9-A22G | 0.44(−) | 0.02(−) | 15.32 | (+) | non-TNBC | ||
| 35 | TCGA-AR-A24Q | 1.00(+) | 0.36(−) | 20.67 | (−) | non-TNBC | ||
| 36 | TCGA-E2-A1B0 | 0.14(−) | 0.26(−) | 563.81 | (+) | (+) | non-TNBC | |
| 37 | TCGA-AR-A251 | 1.57(+) | 0.10(−) | 14.02 | (Ind) | (Equiv) | (−) | non-TNBC |
| 38 | TCGA-A2-A4RX | 0.68(+) | 0.93(+) | 26.64 | (−) | (−) | non-TNBC | |
| 39 | TCGA-AR-A1AJ | 1.47(+) | 0.07(−) | 9.74 | (−) | non-TNBC | ||
| 40 |
|
| (−) | (−) | ( |
| ||
| 41 | TCGA-BH-A5IZ | 5.12(+) | 0.03(−) | 28.08 | (−) | (−) | non-TNBC | |
| 42 | TCGA-D8-A13Y | 15.48(+) | 4.17(+) | 4.83 | (−) | (−) | non-TNBC | |
| 43 | TCGA-LL-A8F5 | 1.08(+) | 0.04(−) | 11.86 | (−) | (−) | non-TNBC | |
Note: Individuals highlighted in bold correspond to individuals previously identified as suspicious as described in the Data description section. FPKM: fragments per kilobase million; Ind: indeterminate; Equiv: equivocal; ER: estrogen receptor; PR: progesterone receptor; HER2: human epidermal growth factor receptor 2.
Figure 2.Cellwise outlier map. The columns correspond to the genes selected by the robust sparse logistic model. The rows correspond to 30 non-TNBC patients (label nT), 30 TNBC patients (label T), and the 43 outliers found by the robust fit.
Figure 3.Interpretation of genes selected in the robust sparse logistic model. The color coding corresponds to the color determined by the DDC map.
Genes selected by the robust sparse logistic method, corresponding coefficients (rounded to 3 digits) and their color coding.
| Gene | Coef | Color | Gene | Coef | Color | ||
|---|---|---|---|---|---|---|---|
| 0 | Intercept | 0.225 | None | 19 |
| 0.024 | Red |
| 1 |
| 0.368 | Yellow | 20 |
| 0.021 | Yellow |
| 2 |
| 0.314 | Red | 21 |
| 0.020 | Yellow |
| 3 |
| 0.297 | Red | 22 |
| 0.018 | Yellow |
| 4 |
| 0.260 | Red | 23 |
| 0.008 | Yellow |
| 5 |
| 0.252 | Red | 24 |
| −0.021 | Blue |
| 6 |
| 0.215 | Red | 25 |
| −0.030 | yellow |
| 7 |
| 0.134 | Yellow | 26 |
| −0.044 | Yellow |
| 8 |
| 0.125 | Red | 27 |
| −0.050 | Yellow |
| 9 |
| 0.094 | Red | 28 |
| −0.051 | Blue |
| 10 |
| 0.088 | Red | 29 |
| −0.053 | Yellow |
| 11 |
| 0.083 | Red | 30 |
| −0.072 | Blue |
| 12 |
| 0.077 | Red | 31 |
| −0.080 | Yellow |
| 13 |
| 0.061 | Red | 32 |
| −0.097 | Blue |
| 14 |
| 0.048 | Yellow | 33 |
| −0.246 | Yellow |
| 15 |
| 0.041 | Yellow | 34 |
| −0.338 | Blue |
| 16 |
| 0.034 | Red | 35 |
| −0.441 | Blue |
| 17 |
| 0.031 | Yellow | 36 |
| −0.551 | Yellow |
| 18 |
| 0.025 | Yellow |
Note: The genes are sorted by their coefficient.
Figure 4.Representation of the correlation between the genes selected by the robust sparse logistic model. The color coding corresponds to the color determined by the DDC map. (a) Correlations for non-TNBC patients. (b) Correlations for TNBC patients.