| Literature DB >> 26132585 |
Heloisa Helena Milioli1, Renato Vimieiro2, Carlos Riveros3, Inna Tishchenko3, Regina Berretta3, Pablo Moscato3.
Abstract
BACKGROUND: The prediction of breast cancer intrinsic subtypes has been introduced as a valuable strategy to determine patient diagnosis and prognosis, and therapy response. The PAM50 method, based on the expression levels of 50 genes, uses a single sample predictor model to assign subtype labels to samples. Intrinsic errors reported within this assay demonstrate the challenge of identifying and understanding the breast cancer groups. In this study, we aim to: a) identify novel biomarkers for subtype individuation by exploring the competence of a newly proposed method named CM1 score, and b) apply an ensemble learning, as opposed to the use of a single classifier, for sample subtype assignment. The overarching objective is to improve class prediction. METHODS ANDEntities:
Mesh:
Substances:
Year: 2015 PMID: 26132585 PMCID: PMC4488510 DOI: 10.1371/journal.pone.0129711
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1The step-by-step process.
The image shows the method steps based on CM1 score and ensemble learning. The METABRIC discovery set is used to compute the CM1 score, based on the original labels previously assigned with the PAM50 method. This step has an output of 42 discriminative probes selected, the CM1 list. The following step involve the sample subtype classification based on a 10-fold cross-validation. Samples in the METABRIC discovery set are considered to train 24 classifiers using the CM1 list and, alternatively, the PAM50 list. The samples are partitioned into ten folds; then a model is built using 90% of samples, which is used to predict the labels of the remaining 10%. After the ten turns are finished, the level of association between the predicted and original METABRIC labels is computed using several statistics. In the training-test setting, labels of samples in the METABRIC validation set and ROCK set are predicted with the models built in the discovery. Statistics measurements are again computed to assess the model performance on predicting breast cancer subtypes. In both classification steps, the new labels are attributed based on the consensus of the majority of the classifiers. Finally, the results or new labels are compared against the clinical data, the current markers ER, PR and HER2, and survival curves.
CM1 list.
| Probe ID | Gene name | Gene symbol and aliases | [Refs.] |
|---|---|---|---|
| ILMN_1684217 | Aurora kinase B |
| [ |
| ILMN_1683450 | Cell division cycle associated 5 |
| [ |
| ILMN_1747016 | Centrosomal protein 55kDa |
| [ |
| ILMN_2212909 | Maternal embryonic leucine zipper kinase |
| [ |
| ILMN_1714730 | Ubiquitin-conjugating enzyme E2C |
| [ |
| ILMN_1796059 | Ankyrin repeat domain 30A |
| [ |
| ILMN_1651329 | Long intergenic non-protein coding RNA 993 |
| |
| ILMN_2310814 | Microtubule-associated protein tau |
| [ |
| ILMN_1728787 | Anterior gradient 3 |
| [ |
| ILMN_1688071 | N-acetyltransferase 1 |
| [ |
| ILMN_1729216 | Crystallin, alpha B |
| [ |
| ILMN_1666845 | Keratin 17 |
| [ |
| ILMN_1786720 | Prominin 1 |
| [ |
| ILMN_1753101 | V-set domain containing T cell activation inhibitor 1 |
| [ |
| ILMN_1798108 | Chromosome 6 open reading frame 211 |
| |
| ILMN_1747911 | Cyclin-dependent kinase 1 |
| [ |
| ILMN_1666305 | Cyclin-dependent kinase inhibitor 3 |
| [ |
| ILMN_1678535 | Estrogen receptor 1 |
| [ |
| ILMN_2149164 | Secreted frizzled-related protein 1 |
| [ |
| ILMN_1788874 | Serpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 3 |
| [ |
| ILMN_1785570 | Sushi domain containing 3 |
| [ |
| ILMN_1803236 | Chloride channel accessory 2 |
| [ |
| ILMN_2161820 | Glycine-N-acyltransferase-like 2 |
| [ |
| ILMN_1810978 | Mucin-like 1 |
| [ |
| ILMN_1773459 | SRY (sex determining region Y)-box 11 |
| [ |
| ILMN_1674533 | Transient receptor potential cation channel, subfamily V, member 6 |
| [ |
| ILMN_1687235 ILMN_2358760 | Hepsin |
| [ |
| ILMN_1655915 | Matrix metallopeptidase 11 (stromelysin 3) |
| [ |
| ILMN_1711470 | Ubiquitin-conjugating enzyme E2T (putative) |
| [ |
| ILMN_1740609 | Chemokine (C-C motif) ligand 15 |
| [ |
| ILMN_1789507 | Collagen, type XI, alpha 1 |
| [ |
| ILMN_1651282 | Collagen, type XVII, alpha 1 |
| [ |
| ILMN_1723684 | Duffy blood group, atypical chemokine receptor |
| [ |
| ILMN_1809099 | Interleukin 33 |
| [ |
| ILMN_1766650 | Forkhead box A1 |
| [ |
| ILMN_1811387 | Trefoil factor 3 (intestinal) |
| [ |
| ILMN_1738401 | Forkhead box C1 |
| [ |
| ILMN_1689146 | Gamma-aminobutyric acid (GABA) A receptor, pi |
| [ |
| ILMN_1807423 | Insulin-like growth factor 2 mRNA binding protein 3 |
| [ |
| ILMN_1692938 | Phosphoserine aminotransferase 1 |
| [ |
| ILMN_1668766 | Rhophilin associated tail protein 1 |
| [ |
Scores and ranks for the CM1 list.
| Luminal A | Luminal B | Her2 | Normal | Basal | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Probe ID | score | rank | score | rank | score | rank | score | rank | score | rank | Symbol | PAM50 |
| ILMN_1728787 | 0.203 | 5 | 0.144 | 5 | -0.314 | 2 | 54 | -0.461 | 3 | AGR3 | ||
| ILMN_1796059 | 0.216 | 3 | 8730 | 1434 | 3666 | -0.390 | 5 | ANKRD30A | ||||
| ILMN_1684217 | -0.203 | 1 | 74 | 497 | 146 | 97 | AURKB | |||||
| ILMN_1798108 | 1980 | 0.155 | 2 | 68 | 405 | 179 | C6orf211 | |||||
| ILMN_1740609 | 476 | 43 | 970 | 0.252 | 3 | 2776 | CCL15 | |||||
| ILMN_1747911 | 80 | 0.144 | 4 | 2080 | 194 | 1496 | CDC2 | |||||
| ILMN_1683450 | -0.196 | 3 | 30 | 306 | 79 | 166 | CDCA5 | |||||
| ILMN_1666305 | 16 | 0.146 | 3 | 438 | 167 | 917 | CDKN3 | |||||
| ILMN_1747016 | -0.195 | 5 | 88 | 362 | 73 | 127 | CEP55 | x | ||||
| ILMN_1803236 | 1875 | 354 | 0.316 | 3 | 688 | 13483 | CLCA2 | |||||
| ILMN_1789507 | 12176 | 5363 | 1820 | -0.155 | 3 | 9245 | COL11A1 | |||||
| ILMN_1651282 | 915 | 16 | 4821 | 0.244 | 4 | 12205 | COL17A1 | |||||
| ILMN_1729216 | 6657 | -0.153 | 5 | 3008 | 52 | 45 | CRYAB | |||||
| ILMN_1723684 | 456 | 14 | 2830 | 0.255 | 2 | 4215 | DARC | |||||
| ILMN_1678535 | 8 | 0.181 | 1 | -0.360 | 1 | 7 | -0.440 | 4 | ESR1 | x | ||
| ILMN_1766650 | 70 | 85 | 12522 | 216 | -0.478 | 2 | FOXA1 | x | ||||
| ILMN_1738401 | 1047 | 10 | 2254 | 226 | 0.443 | 1 | FOXC1 | x | ||||
| ILMN_1689146 | 1177 | 13 | 1833 | 283 | 0.414 | 2 | GABRP | |||||
| ILMN_2161820 | 310 | 270 | 0.333 | 1 | 791 | 1479 | GLYATL2 | |||||
| ILMN_1687235 | 79 | 1942 | 58 | -0.157 | 2 | 211 | HPN | |||||
| ILMN_2358760 | 105 | 1941 | 73 | -0.152 | 4 | 284 | HPN | |||||
| ILMN_1807423 | 1269 | 2087 | 21820 | 11567 | 0.405 | 3 | IGF2BP3 | |||||
| ILMN_1809099 | 3400 | 141 | 6282 | 0.275 | 1 | 23413 | IL33 | |||||
| ILMN_1666845 | 8365 | -0.186 | 2 | 3879 | 35 | 29 | KRT17 | x | ||||
| ILMN_1651329 | 0.221 | 1 | 2481 | 1149 | 1159 | 20 | LOC646360 | |||||
| ILMN_2310814 | 0.221 | 2 | 8776 | 33 | 1131 | 23 | MAPT | x | ||||
| ILMN_2212909 | -0.196 | 4 | 137 | 501 | 92 | 65 | MELK | x | ||||
| ILMN_1655915 | 5274 | 3486 | 3832 | -0.166 | 1 | 4148 | MMP11 | x | ||||
| ILMN_1810978 | 20520 | 9 | 0.326 | 2 | 6 | 1495 | MUCL1 | |||||
| ILMN_1688071 | 0.215 | 4 | 902 | -0.256 | 5 | 24 | 19 | NAT1 | x | |||
| ILMN_1786720 | 988 | -0.174 | 3 | 273 | 465 | 20 | PROM1 | |||||
| ILMN_1692938 | 68 | 343 | 93 | 1864 | 0.391 | 5 | PSAT1 | |||||
| ILMN_1668766 | 721 | 62 | 1415 | 368 | 0.405 | 4 | ROPN1 | |||||
| ILMN_1788874 | 148 | 4633 | -0.259 | 4 | 1961 | 1462 | SERPINA3 | |||||
| ILMN_2149164 | 11497 | -0.203 | 1 | 1697 | 0.244 | 5 | 40 | SFRP1 | x | |||
| ILMN_1773459 | 185 | 621 | 0.293 | 5 | 10046 | 483 | SOX11 | |||||
| ILMN_1785570 | 11 | 2499 | -0.308 | 3 | 438 | 82 | SUSD3 | |||||
| ILMN_1811387 | 26 | 64 | 1263 | 661 | -0.521 | 1 | TFF3 | |||||
| ILMN_1674533 | 643 | 605 | 0.300 | 4 | 2756 | 1819 | TRPV6 | |||||
| ILMN_1714730 | -0.200 | 2 | 9 | 318 | 43 | 353 | UBE2C | x | ||||
| ILMN_1711470 | 56 | 7 | 1732 | -0.145 | 5 | 1113 | UBE2T | x | ||||
| ILMN_1753101 | 474 | -0.153 | 4 | 2424 | 3373 | 1522 | VTCN1 | |||||
The CM1 scores for the topmost 5 positive and negative probe IDs in each subtype are given. The ranks correspond to the position of the probe from the topmost positive or negative (with 1 being the top ranked score at either side). The rightmost two columns indicate the gene symbol the probe maps to, and which genes appear also in the PAM50 list.
Fig 2The gene expression profile of the balanced top ten probes selected for each of the five breast cancer intrinsic subtypes across 997 samples from the discovery set.
The annotated genes are defined for each subtype as an intrinsic, highly discriminative, signature. Samples were ordered according to the gene expression similarities in each breast cancer subtype. Colours represent the selected genes and sample subtypes: luminal A (yellow), luminal B (green), HER2-enriched (purple), normal-like (blue), and basal-like (red).
Fig 3Gene expression patterns of the 42 probes selected using the CM1 score.
The heat map diagram exhibit 42 probes (rows) and 997 samples (columns) from the discovery set ordered according to gene expression similarity, based on a memetic algorithm [27]. The labels highlighted on top show the sample distribution according to the ER positive and negative status. It also illustrates the original PAM50 subtypes luminal A (yellow), luminal B (green), HER2-enriched (purple), normal-like (blue), and basal-like (red) in the METABRIC discovery set. Two probes in the CM1 list refer to the same gene, HPN, which was then appended with the corresponding Illumina probe ID.
Fig 4The mRNA log2 normalised expression values of 7 novel highly discriminative biomarkers across the five intrinsic subtypes in the METABRIC discovery and validation sets, and ROCK set.
The box plot uncover the values of 997 samples in the METABRIC discovery set, 989 in the validation set, and 1570 in the ROCK test set.
The ensemble learning overall performance on assigning labels to samples in the METABRIC discovery and validation sets, and ROCK test set.
| CM1 list | PAM50 list | |||
|---|---|---|---|---|
| Dataset | CV | AS | CV | AS |
|
| 0.731 ± 0.057 | 0.763 ± 0.060 | 0.752 ± 0.064 | 0.781 ± 0.070 |
|
| 0.632 ± 0.036 | 0.641 ± 0.039 | 0.643 ± 0.041 | 0.650 ± 0.047 |
|
| 0.571 ± 0.060 | 0.673 ± 0.077 | 0.578 ± 0.054 | 0.687 ± 0.081 |
Values are given as average ± std. deviation. CV- Cramer’s V; AS- Average Sensitivity
Contingency tables for predicted labels using the 24 classifiers trained with the CM1 list.
| METABRIC discovery | METABRIC validation | ROCK test set | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LA | LB | H | N | B | I | LA | LB | H | N | B | I | LA | LB | H | N | B | I | |
|
| 435 | 19 | 2 | 2 | 0 | 8 | 252 | 2 | 0 | 0 | 0 | 1 | 452 | 122 | 2 | 0 | 0 | 17 |
|
| 24 | 234 | 0 | 0 | 0 | 10 | 62 | 156 | 0 | 0 | 0 | 6 | 18 | 371 | 42 | 0 | 2 | 14 |
|
| 4 | 4 | 67 | 0 | 2 | 10 | 23 | 45 | 71 | 2 | 2 | 10 | 0 | 1 | 13 | 0 | 0 | 0 |
|
| 13 | 0 | 8 | 31 | 0 | 6 | 80 | 0 | 0 | 59 | 0 | 5 | 115 | 8 | 36 | 74 | 56 | 50 |
|
| 0 | 0 | 10 | 2 | 103 | 3 | 6 | 7 | 22 | 19 | 142 | 17 | 0 | 0 | 0 | 7 | 166 | 4 |
Rows contain labels assigned by the majority of classifiers trained with the CM1 list, while columns contain the the original METABRIC labels assigned using the PAM50 method. In this table, LA corresponds to luminal A, LB corresponds to luminal B, H to HER2-enriched, N to normal-like, and B to basal-like. Labels marked as I refer to inconsistent assignments; situations where the classifiers did not achieve the majority on attributing a subtype label.
Contingency tables for predicted labels using the 24 classifiers trained with the PAM50 list.
| METABRIC discovery | METABRIC validation | ROCK test set | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LA | LB | H | N | B | I | LA | LB | H | N | B | I | LA | LB | H | N | B | I | |
|
| 440 | 17 | 1 | 1 | 0 | 7 | 254 | 0 | 0 | 0 | 0 | 1 | 530 | 46 | 2 | 0 | 0 | 15 |
|
| 25 | 239 | 0 | 0 | 0 | 4 | 56 | 162 | 0 | 0 | 0 | 6 | 53 | 327 | 34 | 0 | 3 | 30 |
|
| 0 | 5 | 72 | 0 | 1 | 9 | 21 | 39 | 80 | 0 | 0 | 13 | 0 | 0 | 12 | 0 | 0 | 2 |
|
| 9 | 0 | 2 | 34 | 1 | 12 | 82 | 0 | 0 | 55 | 0 | 7 | 105 | 4 | 18 | 92 | 67 | 53 |
|
| 0 | 0 | 7 | 1 | 103 | 7 | 4 | 7 | 20 | 14 | 145 | 23 | 0 | 0 | 3 | 0 | 172 | 2 |
Rows contain labels assigned by the majority of classifiers trained with the PAM50 list, while columns contain the the original METABRIC labels assigned using the PAM50 method. In this table, LA corresponds to luminal A, LB corresponds to luminal B, H to HER2-enriched, N to normal-like, and B to basal-like. Labels marked as I refer to inconsistent assignments; situations where the classifiers did not achieve the majority on attributing a subtype label.
Contingency tables for predicted labels using the 24 classifiers trained with CM1 and PAM50 lists.
| METABRIC discovery | METABRIC validation | ROCK Set | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LA | LB | H | N | B | I | LA | LB | H | N | B | I | LA | LB | H | N | B | I | |
|
| 450 | 15 | 0 | 4 | 0 | 7 | 390 | 14 | 1 | 4 | 0 | 14 | 550 | 8 | 0 | 10 | 0 | 17 |
|
| 20 | 235 | 0 | 0 | 0 | 2 | 12 | 185 | 8 | 0 | 0 | 5 | 112 | 361 | 0 | 0 | 0 | 29 |
|
| 0 | 0 | 75 | 2 | 1 | 9 | 0 | 1 | 83 | 0 | 1 | 8 | 0 | 4 | 67 | 0 | 8 | 21 |
|
| 0 | 0 | 0 | 28 | 0 | 7 | 6 | 0 | 0 | 61 | 1 | 12 | 0 | 0 | 0 | 67 | 0 | 7 |
|
| 0 | 0 | 2 | 0 | 101 | 2 | 0 | 0 | 1 | 0 | 140 | 3 | 0 | 0 | 0 | 2 | 219 | 3 |
|
| 4 | 11 | 5 | 2 | 3 | 12 | 9 | 8 | 7 | 4 | 3 | 8 | 26 | 4 | 2 | 13 | 15 | 25 |
Rows contain the labels assigned by the majority of classifiers trained with the CM1 list, while columns contain labels assigned by the majority of classifiers trained with PAM50 list. In this table, LA corresponds to luminal A, LB corresponds to luminal B, H to HER2-enriched, N to normal-like, and B to basal-like. Labels marked as I refer to inconsistent assignments; situations where the classifiers did not achieve the majority on attributing a subtype label.
Agreement of the 24 classifiers on assigning labels to samples in the data sets measured by Fleiss’ kappa statistic.
| METABRIC | ROCK | |||
|---|---|---|---|---|
| discovery | validation | test set | ||
|
| CM1 | 0.73 | 0.753 | 0.626 |
| PAM50 | 0.724 | 0.729 | 0.59 | |
|
| CM1 | 0.814 | 0.596 | 0.591 |
| PAM50 | 0.84 | 0.618 | 0.641 | |
|
| 0.859 | 0.832 | 0.804 | |
Rows entitled Among classifiers indicate agreement of classifiers alone, not considering the labels. Predicted vs Original show the agreement between the mostly predicted and initial labels of samples (PAM50 method). Finally, rows entitled CM1 vs PAM50 contain the agreement between the mostly predicted labels using the CM1 and PAM50 lists with the ensemble learning.
Agreement measured by the Adjusted Rand Index between different samples’ labellings.
| METABRIC | ROCK | ||
|---|---|---|---|
| discovery | validation | test set | |
|
| 0.757 | 0.426 | 0.453 |
|
| 0.792 | 0.457 | 0.507 |
|
| 0.822 | 0.788 | 0.642 |
This contains the agreement between the original and predicted labels of samples in the discovery and validation sets. CM1-METABRIC refers to agreement between the labels predicted by the majority of classifiers trained with the CM1 list and the original METABRIC labels; PAM50-METABRIC is the agreement between labels predicted by the majority of classifiers trained with the PAM50 list and original METABRIC labels; and CM1-PAM50 is the agreement between predicted labels using both lists.
Fig 5Class distribution in the METABRIC discovery and validation sets, and in the ROCK set.
The bars represent the number of samples in each breast cancer subtype. In the first row, the labels refer to the original assignment using the PAM50 method. The following rows show the new labels attributed using an ensemble of 24 classifiers with PAM50 and CM1 lists, respectively. Samples were classified as inconsistent if there was no consensus between the majority of classifiers as to what should be the correct subtype.
Fig 6Similarity between subtypes distribution in the METABRIC discovery and validation sets, and in the ROCK set.
The image shows the similarity between the subtypes distribution for METABRIC discovery (MD) and validation (MD) sets, and ROCK test set (RS). The labels were assigned in the original data sets using the PAM50 method, and relabelled in this study with an ensemble learning using PAM50 and CM1 lists. The similarity is measured using the square root of the Jensen-Shannon divergence. Darker shades represent more similar distributions, while lighter shades refer to divergent patterns. The diagonal shows the darkest color as each data set is the closest to itself. According to this image, labels assigned using an ensemble learning with CM1 and PAM50 lists are highly similar, and both exhibit lower levels of agreement with the original labels assigned using a single classifier (PAM), or PAM50 method.
Fig 7ER marker distribution across subtypes in the METABRIC data sets.
(A) Discovery and (B) Validation. The bars represent the number of samples with ER positive and negative in the five intrinsic subtypes, based on the patients’ clinical information. The top row is based on the original subtype labels obtained with the PAM50 list and a single classifier (PAM). Middle and bottom rows are based on the labels obtained by Ensemble Learning using the PAM50 and CM1 lists, respectively.
Fig 8PR marker distribution across subtypes in the METABRIC data set.
(A) Discovery and (B) Validation. The bars represent the number of samples with PR positive and negative distributed in the five intrinsic subtypes, based on the patients’ clinical information. The top row is based on the original subtype labels obtained with the PAM50 list and a single classifier (PAM). Middle and bottom rows are based on the labels obtained by Ensemble Learning using the PAM50 and CM1 lists, respectively.
Fig 9HER2 distribution across subtypes in the METABRIC data sets.
(A) Discovery and (B) Validation. The bars represent the number of samples with HER2 amplification (positive or negative) for each intrinsic subtype based on the patients’ clinical information. The top row is based on the original subtype labels obtained with the PAM50 list and a single classifier (PAM). Middle and bottom rows are based on the labels obtained by Ensemble Learning using the PAM50 and CM1 lists, respectively.
Fig 10The survival curves for METABRIC discovery and validation sets.
The survival curves for each breast cancer subtype are generated using Cox proportional hazards model based on the grade and size of the tumour, patient’s age, number of lymph nodes positive and ER status. Each curve represents the survival probability at a certain time after the diagnosis. Ticks on the curve correspond to the observations of patients who are still alive, while drops indicate the death. The probability curves based on the last 10 observations are plotted in dash. The top row is based on the original subtype labels obtained with the PAM50 list and a single classifier (PAM). Middle and bottom rows are based on the labels obtained by Ensemble Learning using the PAM50 and CM1 lists, respectively.