| Literature DB >> 33039710 |
Yue Zhao1, Ziwei Pan2, Sandeep Namburi1, Andrew Pattison3, Atara Posner3, Shiva Balachander3, Carolyn A Paisie1, Honey V Reddi4, Jens Rueter5, Anthony J Gill6, Stephen Fox7, Kanwal P S Raghav8, William F Flynn1, Richard W Tothill9, Sheng Li10, R Krishna Murthy Karuturi11, Joshy George12.
Abstract
BACKGROUND: Cancer of unknown primary (CUP), representing approximately 3-5% of all malignancies, is defined as metastatic cancer where a primary site of origin cannot be found despite a standard diagnostic workup. Because knowledge of a patient's primary cancer remains fundamental to their treatment, CUP patients are significantly disadvantaged and most have a poor survival outcome. Developing robust and accessible diagnostic methods for resolving cancer tissue of origin, therefore, has significant value for CUP patients.Entities:
Keywords: Cancer; Cancer-of-unknown-primary; Cell-of-origin; Classification; Convolutional neural network; Deep learning; Inception model; Machine learning; TCGA
Mesh:
Substances:
Year: 2020 PMID: 33039710 PMCID: PMC7553237 DOI: 10.1016/j.ebiom.2020.103030
Source DB: PubMed Journal: EBioMedicine ISSN: 2352-3964 Impact factor: 8.143
Performance of previously published CUP classification methods.
| Publication | Year | Input type | Feature number | Reported Accuracy (N) | ||||
|---|---|---|---|---|---|---|---|---|
| Training/Cross- validation | Training tumour types | External validation | Validation tumour | Validation tumour types+ | ||||
| 2011 | RT-PCR assay | 92 | 87%/85% (2206) | 30 | 83% (187) | P + M | 28 | |
| 2012 | RT-PCR assay | 92 | - | - | 87%P/82%M (790PM) | P + M | 28 | |
| 2012 | RT-PCR assay | 92 | - | - | 82.1% (184) | P + M | 23 | |
| 2011 | Microarray | 2,000 | (2,136) | 15 | 88.5% (462) | P + M | 15 | |
| 2015 | Microarray | 2,000 | - | - | 89% (157) | P + M | 15 | |
| 2012 | microRNA array | 64 | 87% (1,282) | 42 | 85% (509) | P + M | 42 | |
| 2015 | Microarray | 29,285 | 82% (450) | 18 | 88% (94) | P + M | 18 | |
| 2016 | DNA methylation microarray | 485,577 | 2,790 | 38 | 94% (534) | M | 21 | |
| 2019 | RNA-seq | 17,688 | 97% 10,822 | 40 T | 86% (201) | M | 40 | |
| 2019 | Targeted DNA sequencing | 341 | 73.8% (7,791) | 22 | 74.1% (11,644) | P + M | 22 | |
| 2020 | WGS | - | 91% (2,206) | 24 | 88%P 83%M (2120) | P + M | 16 | |
v2 version 2 of CancerTypeID GEP test
+ Validation series of commercial tests
T= Tumor
AN = Adjacent normal
P= primary tumors
M= metastatic tumours.
Fig. 1Prediction workflow for primary tumour types and subtypes. (a) Schematic showing the learning procedure used to train the 1D-Inception model from labeled TCGA and ICGC transcriptomes spanning 32 cancer types for primary tumour type prediction. Models were trained with 70% training data and validated with 30% test data on normalized and standard scaled expression profiles. 817 features were selected (see Materials and methods). Primary tumour type classification performance was evaluated via cross-validation on the learning set of TCGA and ICGC primary tumour samples and external validation utilizing primary tumour types from transcriptomes of metastatic samples and clinical samples. (b) Illustration of 1D Inception Architecture optimized by Talos [47] scanning on TCGA and ICGC dataset. Each rectangle represents a layer in the neural network. For convolutional layers, kernel size is shown, and the same kernel size layer is painted the same color. Max pooling layers are green rectangles with pooling window size inside. Dark grey rectangles are dropout layers with keep probability shown. The concatenation layer has a size of 1696 hidden nodes. This is determined by the output size from the convolutional layers. The bottom portion shows the output layer below two fully connected layers with 128 nodes individually. (c) Schematic showing the learning procedure used to train random forest (RF) models with 11 molecular subtypes for cancer subtype prediction. Models were trained and evaluated using 10-fold cross-validation on normalized and standard scaled expression profiles. N features were selected from each class (see Methods) and pooled for each fold to construct 11 molecular subtype predictors for random forest (RF). Cancer subtype classification performance was evaluated via cross-validation on the learning set and external validation utilizing breast and ovarian cancer datasets.
32 Cancer cohorts for primary classification from TCGA and ICGC.
| Cohort Abbreviation | Cases | Disease Name |
|---|---|---|
| ACC | 79 | Adrenocortical carcinoma |
| BLCA | 726 | Bladder urothelial carcinoma |
| BRCA | 2,320 | Breast invasive carcinoma |
| CESC | 568 | Cervical and endocervical cancers |
| CHOL | 36 | Cholangiocarcinoma |
| COADREAD | 873 | Colon adenocarcinoma & Rectum adenocarcinoma |
| DLBC | 48 | Lymphoid Neoplasm Diffuse Large B-cell Lymphoma |
| ESCA | 184 | Esophageal carcinoma |
| GBM | 319 | Glioblastoma multiforme |
| HNSC | 1,044 | Head and Neck squamous cell carcinoma |
| KICH | 66 | Kidney Chromophobe |
| KIRC | 1,131 | Kidney renal clear cell carcinoma |
| KIRP | 544 | Kidney renal papillary cell carcinoma |
| LAML | 346 | Acute Myeloid Leukemia |
| LGG | 969 | Brain Lower Grade Glioma |
| LIHC | 716 | Liver hepatocellular carcinoma |
| LUAD | 1,058 | Lung adenocarcinoma |
| LUSC | 974 | Lung squamous cell carcinoma |
| MESO | 87 | Mesothelioma |
| OV | 679 | Ovarian serous cystadenocarcinoma |
| PAAD | 323 | Pancreatic adenocarcinoma |
| PCPG | 179 | Pheochromocytoma and Paraganglioma |
| PRAD | 1,097 | Prostate adenocarcinoma |
| SARC | 259 | Sarcoma |
| SKCM | 537 | Skin Cutaneous Melanoma |
| STAD | 865 | Stomach adenocarcinoma |
| TGCT | 150 | Testicular Germ Cell Tumors |
| THCA | 1,067 | Thyroid carcinoma |
| THYM | 120 | Thymoma |
| UCEC | 716 | Uterine Corpus Endometrial Carcinoma |
| UCS | 57 | Uterine Carcinosarcoma |
| UVM | 80 | Uveal Melanoma |
JAX clinical dataset for external validation of primary tumour type predictor.
| Cohort Abbreviation | Cases | Tumour Name |
|---|---|---|
| BRCA | 6 | Breast invasive carcinoma |
| COADREAD | 5 | Colon adenocarcinoma & Rectum adenocarcinoma |
| LUAD | 3 | Lung adenocarcinoma |
| LUSC | 3 | Lung squamous cell carcinoma |
| PRAD | 5 | Prostate adenocarcinoma |
| THCA | 1 | Thyroid carcinoma |
Melbourne dataset for external validation of primary tumour type predictor.
| Cohort Abbreviation | Cases | Tumour Name |
|---|---|---|
| BLCA | 4 | Bladder urothelial carcinoma |
| BRCA | 4 | Breast invasive carcinoma |
| CHOL | 5 | Cholangiocarcinoma |
| COADREAD | 5 | Colon adenocarcinoma & Rectum adenocarcinoma |
| HNSC | 1 | Head and Neck squamous cell carcinoma |
| KIRC | 4 | Kidney renal clear cell carcinoma |
| LIHC | 2 | Liver hepatocellular carcinoma |
| LUAD | 5 | Lung adenocarcinoma |
| LUSC | 3 | Lung squamous cell carcinoma |
| MESO | 3 | Mesothelioma |
| OV | 3 | Ovarian serous cystadenocarcinoma |
| PAAD | 5 | Pancreatic adenocarcinoma |
| PRAD | 5 | Prostate adenocarcinoma |
| SARC | 4 | Sarcoma |
| SKCM | 5 | Skin Cutaneous Melanoma |
| STAD | 3 | Stomach adenocarcinoma |
| TGCT | 4 | Testicular Germ Cell Tumors |
| THCA | 4 | Thyroid carcinoma |
Fig. 2Primary tumour type prediction performance of CNN models on the TCGA dataset. (a) Validation data cross-entropy loss of CNN models. One can observe that the training processes of all three models successfully converged. (b) Overall prediction accuracy of CNN models in cross-validation and external metastasis validation. (c) Per-class accuracy performance of CNN models.
Fig. 3Cross- and external validation of primary tumour type predictor. The 1D-Inception model was constructed for primary tumour type prediction. 32 primary tumour types are grouped by the pan-organ system. (a) Inception model confusion matrix for cross-validation of 32 primary tumour types on TCGA and ICGC dataset. Accuracy for each prediction class is shown to the right of the table. (b) 394 expression profiles of TCGA metastatic tumours from the primary site of origin spanning 11 organs were classified by the primary tumour type predictor. (c) 23 expression profiles of clinical datasets spanning 6 cancer types were classified by primary tumour type predictor. (d) 69 expression profiles of Melbourne dataset spanning 18 cancer types were classified by primary tumour type predictor. Text in contingency table cell c of (b), and (c) shows the number of class i tumour samples classified as class j. The heatmap of the confusion matrix is coloured in grayscale. Colour shading along with the main diagonal shows pan-organ groups.
Fig. 4Unsupervised embedding of expression profiles reveals relationships among primary sites. Expression profiles from all samples in the TCGA dataset were embedded into two dimensions using uniform manifold approximation and projection (UMAP) [86] and colored by primary tumour type. For each cancer, labels are placed near the centroid of the expression profile in the UMAP latent space. Anatomical and histological relationships are emergent and add context to the most common misclassifications in Figure S2a. The following groups of cancers are highlighted with green, blue, and purple ellipses, respectively: i) COADREAD, STAD; ii) BLCA, CESC, ESCA, HNSC, LUSC; iii) GBM, LGG.
Fig. 5Cross- and external validation of molecular subtype predictors. A predictor of molecular subtypes was constructed for each of 11 primary tumour types, spanning 38 molecular subtypes on the TCGA dataset. (a) Per-class accuracy, (b) specificity, and (c) sensitivity of molecular subtype classifications evaluated through cross-validation (Fig. 1c). To further validate these subtype predictors, ovarian (d) and breast (e) subtype predictors were used to predict the respective molecular subtypes in two external datasets (GSE9899 and EGAS00000000083, respectively).