| Literature DB >> 32826944 |
Silvia Cascianelli1, Ivan Molineris2, Claudio Isella2, Marco Masseroli3, Enzo Medico2,4.
Abstract
Stratification of breast cancer (BC) into molecular subtypes by multigene expression assays is of demonstrated clinical utility. In principle, global RNA-sequencing (RNA-seq) should enable reconstructing existing transcriptional classifications of BC samples. Yet, it is not clear whether adaptation to RNA-seq of classifiers originally developed using PCR or microarrays, or reconstruction through machine learning (ML) is preferable. Hence, we focused on robustness and portability of PAM50, a nearest-centroid classifier developed on microarray data to identify five BC "intrinsic subtypes". We found that standard PAM50 is profoundly affected by the composition of the sample cohort used for reference construction, and we propose a strategy, named AWCA, to mitigate this issue, improving classification robustness, with over 90% of concordance, and prognostic ability; we also show that AWCA-based PAM50 can even be applied as single-sample method. Furthermore, we explored five supervised learners to build robust, single-sample intrinsic subtype callers via RNA-seq. From our ML-based survey, regularized multiclass logistic regression (mLR) displayed the best performance, further increased by ad-hoc gene selection on the global transcriptome. On external test sets, mLR classifications reached 90% concordance with PAM50-based calls, without need of reference sample; mLR proven robustness and prognostic ability make it an equally valuable single-sample method to strengthen BC subtyping.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32826944 PMCID: PMC7442834 DOI: 10.1038/s41598-020-70832-2
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Overview. Main steps of our parallel workflows.
Figure 2Subtyping of TCGA dataset varying the sample subset size of interest for multiple runs of standard PAM50 and AWCA-based PAM50: concordances with Ciriello et al.[9] subtype calls (left); pairwise concordance distributions (right).
Figure 3Risk of recurrence. AWCA-PAM50 calls and ROR-C scores compared with Prosigna ROR scores (up) and with PAM50 technical replica scores (center); statistical significance in discriminating 10-year overall survival (OS) status (down).
Figure 4Machine learning survey: classifiers tuned and trained on the TCGA training set with tenfold cross-validation and tested on the unseen samples of the test set (left). Feature selection (right): comparison of each class recall on the TCGA test set for the mLRs trained on the complete gene set, on the four filter-based spaces and on the limma50 and limma50_BWE gene signatures. Normal-like class is excluded from the graph due to the trifling number of samples (only 5) in the TCGA test set.
Accuracies reached with several intrinsic subtyping methods.
| Subtyping method | Feature space of interest | TCGA | TCGA | GSE96058 | GSE96058 |
|---|---|---|---|---|---|
| PAM50* | PAM50 panel | – | 92 | – | 95 |
| mLR | PAM50 panel | 92 | 89 | 93 | 93 |
| mLR | All profiled genes | 88 | 85 | 88 | 89 |
| mLR | limma50 | 92 | 88 | 90 | 91 |
| mLR | limma50_BWE | 93 | 87 | 90 | 91 |
* PAM50 applied on test sets only, using precomputed AWCA references.
Concordances with published calls (accuracies) or AWCA-based PAM50 calls for the main mLR classifiers.
| Training set | Feature space of interest | Intended for | Accuracy on test set (%) | AWCA-PAM50 concordance on test set (%) | External test set | Accuracy on external test set (%) | AWCA-PAM50 concordance on external set (%) |
|---|---|---|---|---|---|---|---|
| PAM50 | RSEM | 89 | 89 | PanCA | 90 | 90 | |
| limma50 | RSEM | 88 | 87 | PanCA | 88 | 90 | |
| limma50_BWE | RSEM | 87 | 87 | PanCA | 87 | 91 | |
| PAM50 | FPKM | 93 | 92 | GSE81538 | 92 | 93 | |
| limma50 | FPKM | 91 | 91 | GSE81538 | 89 | 89 | |
| limma50_BWE | FPKM | 91 | 91 | GSE81538 | 89 | 89 |