| Literature DB >> 26770261 |
Heloisa H Milioli1, Renato Vimieiro2, Inna Tishchenko3, Carlos Riveros3, Regina Berretta3, Pablo Moscato3.
Abstract
BACKGROUND: Multi-gene lists and single sample predictor models have been currently used to reduce the multidimensional complexity of breast cancers, and to identify intrinsic subtypes. The perceived inability of some models to deal with the challenges of processing high-dimensional data, however, limits the accurate characterisation of these subtypes. Towards the development of robust strategies, we designed an iterative approach to consistently discriminate intrinsic subtypes and improve class prediction in the METABRIC dataset.Entities:
Keywords: Breast cancer; CM1 score; Classifiers; Data mining; Ensemble learning; Feature selection; Intrinsic subtypes; METABRIC; Predictor models; Subtype prediction
Year: 2016 PMID: 26770261 PMCID: PMC4712506 DOI: 10.1186/s13040-015-0078-9
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Fig. 1Refinement process. The process is initialized with labels assigned using the PAM50 method. After computing the CM1 score, the top 10 highly discriminative probes are selected for each subtype. This set of features is used to train the 24 distinct classifiers for a 10-fold cross-validation classification. Samples are relabelled (eventually with the same label) if the classifiers agree in at least 50 % of the cases; otherwise they are marked as inconsistent and not further considered in the iteration process. The stopping criterion is reached when there are no more changes in the sample labels or selected feature set, or when the desired Fleiss’ kappa is achieved. After stopping, the final feature set and sample labels are used to classify the samples previously marked as inconsistent or from the validation dataset. These samples are run through the same refinement procedure; inconsistent samples are reclassified and labels are refined
Fig. 2The heat map of refined intrinsic features selected using CM1 score in the refinement process. The heat map diagram exhibit 35 probes (rows) and 1992 samples (columns) from the discovery and validation sets ordered according to gene expression similarity. For visualisation, the expression values are normalised across the probes using a two-sided threshold of 1 % (for under- and over-expression). The bars on the bottom show the sample distribution according to the refined and original labels assigned to the METABRIC cohort. The subtypes are defined as follow: luminal A (blue), luminal B (green), HER2-enriched (yellow), normal-like (purple), basal-like (red), and inconsistent (grey)
Contingency table for predicted labels vs. initial subtypes (rows and columns, respectively)
| Subtypes | Lum A | Lum B | HER2 | Basal | Normal | Summary |
|---|---|---|---|---|---|---|
| Lum A | 563 | 94 | 11 | 2 | 58 | 728 |
| Lum B | 102 | 383 | 77 | 19 | 19 | 600 |
| HER2 | 7 | 1 | 149 | 59 | 18 | 234 |
| Basal | 0 | 0 | 0 | 230 | 3 | 233 |
| Normal | 33 | 0 | 1 | 15 | 95 | 144 |
| Inconsistent | 16 | 14 | 2 | 6 | 9 | 47 |
| Summary | 721 | 492 | 240 | 331 | 202 | 1986 |
Number of samples for each clinical marker in the PAM50 subtypes and refined labels
|
| ||||||
|---|---|---|---|---|---|---|
| Class ∖Marker | PR+ | PR- | ER+ | ER- | HER2+ | HER2- |
| Luminal A | 550 | 171 | 717 | 4 | 23 | 698 |
| Luminal B | 309 | 183 | 492 | 0 | 45 | 447 |
| Her2-enriched | 51 | 189 | 98 | 142 | 135 | 105 |
| Basal-like | 29 | 302 | 41 | 290 | 30 | 301 |
| Normal-like | 106 | 96 | 164 | 38 | 16 | 186 |
|
| ||||||
| Class ∖Marker | PR+ | PR- | ER+ | ER- | HER2+ | HER2- |
| Luminal A | 558 | 170 | 726 | 2 | 14 | 714 |
| Luminal B | 358 | 242 | 599 | 1 | 83 | 517 |
| Her2-enriched | 11 | 223 | 19 | 215 | 139 | 95 |
| Basal-like | 7 | 226 | 9 | 224 | 4 | 229 |
| Normal-like | 85 | 59 | 115 | 29 | 4 | 140 |
| Inconsistent | 26 | 21 | 44 | 3 | 5 | 42 |
Fig. 3The survival curves for original and refined labels in the METABRIC discovery and validation sets. Each curve represents the survival probability at a certain time after the diagnosis. Drops in the curve indicate death. The probability of the last ten observations are plotted in dash