| Literature DB >> 32024849 |
Wei Jiao1, Gurnit Atwal1,2,3, Paz Polak4, Rosa Karlic5, Edwin Cuppen6,7, Alexandra Danyi8, Jeroen de Ridder8, Carla van Herpen9, Martijn P Lolkema10, Neeltje Steeghs11, Gad Getz12, Quaid Morris1,2,3,13,14, Lincoln D Stein15,16.
Abstract
In cancer, the primary tumour's organ of origin and histopathology are the strongest determinants of its clinical behaviour, but in 3% of cases a patient presents with a metastatic tumour and no obvious primary. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, we train a deep learning classifier to predict cancer type based on patterns of somatic passenger mutations detected in whole genome sequencing (WGS) of 2606 tumours representing 24 common cancer types produced by the PCAWG Consortium. Our classifier achieves an accuracy of 91% on held-out tumor samples and 88% and 83% respectively on independent primary and metastatic samples, roughly double the accuracy of trained pathologists when presented with a metastatic tumour without knowledge of the primary. Surprisingly, adding information on driver mutations reduced accuracy. Our results have clinical applicability, underscore how patterns of somatic passenger mutations encode the state of the cell of origin, and can inform future strategies to detect the source of circulating tumour DNA.Entities:
Mesh:
Year: 2020 PMID: 32024849 PMCID: PMC7002586 DOI: 10.1038/s41467-019-13825-8
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Distribution of tumour types in the PCAWG training and test data sets.
| Abbreviation | Organ system | Tumour type | Tumour samples |
|---|---|---|---|
| Liver-HCC | Liver | Liver hepatocellular carcinoma | 306 |
| Panc-AdenoCA | Pancreas | Pancreatic adenocarcinoma | 235 |
| Breast-AdenoCA | Breast | Breast adenocarcinoma | 198 |
| Prost-AdenoCA | Prostate gland | Prostate adenocarcinoma | 189 |
| CNS-Medullo | Brain, cranial nerves and spinal cord | Medulloblastoma | 146 |
| Kidney-RCC | Kidney | Renal cell carcinoma (proximal tubules) | 143 |
| Ovary-AdenoCA | Ovary | Ovarian adenocarcinoma | 112 |
| Skin-Melanoma | Skin | Skin-melanoma | 106 |
| Lymph-BNHL | Lymph nodes | Mature B-cell lymphoma | 105 |
| Eso-AdenoCA | Oesophagus | Oesophageal adenocarcinoma | 98 |
| Lymph-CLL | Blood, bone marrow and hematopoietic sysstem | Chronic lymphocytic leukaemia | 95 |
| CNS-PiloAstro | Brain, cranial nerves and spinal cord | Pilocytic astrocytoma | 89 |
| Panc-Endocrine | Pancreas | Pancreatic neuroendocrine tumour | 85 |
| Stomach-AdenoCA | Stomach | Gastric adenocarcinoma | 70 |
| Head-SCC | Gum, floor of mouth and other mouth | Head/neck squamous cell carcinoma | 57 |
| ColoRect-AdenoCA | Large intestine (excluding appendix) | Colorectal adenocarcinoma | 52 |
| Lung-SCC | Lung and bronchus | Lung squamous cell carcinoma | 48 |
| Thy-AdenoCA | Thyroid gland | Thyroid adenocarcinoma | 48 |
| Myeloid-MPN | Blood, bone marrow and hematopoietic system | Myeloproliferative neoplasm | 46 |
| Kidney-ChRCC | Kidney | Renal cell carcinoma (distal tubules) | 45 |
| Bone-Osteosarc | Bones and joints | Sarcoma, bone | 44 |
| CNS-GBM | Brain, cranial nerves and spinal cord | Diffuse glioma | 41 |
| Uterus-AdenoCA | Uterus, nos | Uterine adenocarcinoma | 40 |
| Lung-AdenoCA | Lung and bronchus | Lung adenocarcinoma | 38 |
WGS feature types used in classifiers.
| Feature category | Feature type | Feature count | Description |
|---|---|---|---|
| Mutation distribution | SNV-BIN | 2897 | Number of SNVs per 1-Mbp bin, and per chromosome, normalised against the total number of SNVs per sample |
| CNA-BIN | 2826 | Number of CNAs per 1-Mbp bin | |
| SV-BIN | 2929 | Number of SVs per 1-Mbp bin, and per chromosome, normalised against the total number of SV per sample | |
| INDEL-BIN | 2757 | Number of SNVs per 1-Mbp bin, and per chromosome, normalised against the total number of INDEL per sample | |
| Mutation type | MUT-WGS | 150 | Type of single-nucleotide substitution, double- and triple-nucleotide substitution (plus its adjacent nucleotide neighbours) |
| Driver gene/pathway | GEN | 554 | Presence of an impactful mutation in a suspected driver gene |
| MOD | 1865 | Presence of an impactful mutation in a gene belonging to a suspected driver pathway |
Fig. 1Comparison of tumour-type classifiers using single and multiple feature types.
a Radar plots describing the cross-validation-derived accuracy (F1) score of Random Forest classifiers trained on each of 7 individual feature categories, across six representative tumour types. b Summary of Random Forest classifier accuracy (F1) trained on individual feature categories across all 24 tumour types. c Accuracy of classifiers trained on multiple feature categories. RF Best Models corresponds to the cross-validation F1 scores of Random Forest classifiers trained on the three best single-feature categories for all 24 tumour types. DNN Model shows the distribution of F1 scores for held-out samples for a multi-class neural network trained using passenger mutation distribution and type. DNN Model + Drivers shows F1 scores for the neural net when driver genes and pathways are added to the training features. The centre line in the boxplot represents the median of the F1 scores. The lower and upper bounds of the box represent the first and third quartile. The whiskers extend to 1.5 IQR plus the third quartile or minus the first quantile.
Fig. 2Heatmap displaying the accuracy of the merged classifier using a held-out portion of the PCAWG data set for evaluation.
Each row corresponds to the true tumour type; columns correspond to the class predictions emitted by the DNN. Cells are labelled with the percentage of tumours of a particular type that were classified by the DNN as a particular type. The recall and precision of each classifier are shown in the colour bars at the top and left sides of the matrix. All values represent the mean of 10 runs using selected data set partitions. Due to rounding of values, some rows add up to slightly more or less than 100%.
Fig. 3Performance of the DNN on held-out PCAWG data.
a The relationship between training set size and prediction accuracy of the DNN is shown for each tumour type. The blue line represents a regression line fit using LOESS regression, while the grey area represents a 95% confidence interval for the regression function. b Accuracy of the classifier when it is asked to identify the correct tumour type among its top N-ranked predictions. The blue dashed line is the median true-positive rate among all 24 tumour classes. The green and red dashed lines correspond to the true- positive rate for the best- and worst-performing tumour classes.
Fig. 4Prediction accuracy for the DNN against two independent validation data sets.
a Primary tumours. b Metastatic tumours. Each row corresponds to the true tumour type; columns correspond to the class predictions emitted by the DNN. Cells are labelled with the percentage of tumours of a particular type that were classified by the DNN as a particular type. The recall and precision of each classifier are shown in the colour bars at the top and left sides of the matrix. Due to rounding of values, some rows add up to slightly more or less than 100%.
Confusion matrix.
| Is the unknown sample a member of a particular histopathological type? | Predicted yes | Predicted no |
|---|---|---|
| Actually yes | TP | FN |
| Actually no | FP | TN |