| Literature DB >> 29183400 |
Kee Pang Soh1,2, Ewa Szczurek3, Thomas Sakoparnig4,5, Niko Beerenwinkel6,7.
Abstract
BACKGROUND: Establishing the cancer type and site of origin is important in determining the most appropriate course of treatment for cancer patients. Patients with cancer of unknown primary, where the site of origin cannot be established from an examination of the metastatic cancer cells, typically have poor survival. Here, we evaluate the potential and limitations of utilising gene alteration data from tumour DNA to identify cancer types.Entities:
Keywords: Cancer diagnostics; Cancer genomics; Cancer-type prediction; Machine learning; Personalised medicine
Mesh:
Substances:
Year: 2017 PMID: 29183400 PMCID: PMC5706302 DOI: 10.1186/s13073-017-0493-2
Source DB: PubMed Journal: Genome Med ISSN: 1756-994X Impact factor: 11.117
Cancer types and their respective sample sizes
| Cancer type (data set) | Class label | Sample size |
|---|---|---|
| Bladder urothelial carcinoma (TCGA, Nature 2014) [ | 1 | 127 |
| Breast invasive carcinoma (TCGA, Cell 2015) [ | 2 | 973 |
| Colorectal adenocarcinoma (TCGA, Nature 2012) [ | 3 | 212 |
| Glioblastoma (TCGA, Cell 2013) [ | 4 | 280 |
| Head and neck squamous cell carcinoma (TCGA, Nature 2015) [ | 5 | 279 |
| Kidney renal clear cell carcinoma (TCGA, Nature 2013) [ | 6 | 418 |
| Acute myeloid leukaemia (TCGA, NEJM 2013) [ | 7 | 190 |
| Lung adenocarcinoma (TCGA, Nature 2014) [ | 8 | 230 |
| Lung squamous cell carcinoma (TCGA, Nature 2012) [ | 9 | 178 |
| Ovarian serous cystadenocarcinoma (TCGA, Nature 2011) [ | 10 | 316 |
| Uterine corpus endometrial carcinoma (TCGA, Nature 2013) [ | 11 | 240 |
| Adenoid cystic carcinoma (MSKCC, Nat Genet 2013) [ | 12 | 55 |
| Brain lower grade glioma (TCGA, Provisional) | 13 | 279 |
| Cervical squamous cell carcinoma and endocervical adenocarcinoma (TCGA, Provisional) | 14 | 191 |
| Kidney renal papillary cell carcinoma (TCGA, Provisional) | 15 | 161 |
| Liver hepatocellular carcinoma (AMC, Hepatology 2014) [ | 16 | 231 |
| Pancreatic adenocarcinoma (TCGA, Provisional) | 17 | 145 |
| Prostate adenocarcinoma (TCGA, Cell 2015) [ | 18 | 332 |
| Skin cutaneous melanoma (TCGA, Provisional) | 19 | 278 |
| Stomach adenocarcinoma (TCGA, Nature 2014) [ | 20 | 287 |
| Papillary thyroid carcinoma (TCGA, Cell 2014) [ | 21 | 399 |
| Adrenocortical carcinoma (TCGA, Provisional) | 22 | 88 |
| Kidney chromophobe (TCGA, Cancer Cell 2014) [ | 23 | 65 |
| Pheochromocytoma and paraganglioma (TCGA, Provisional) | 24 | 161 |
| Sarcoma (TCGA, Provisional) | 25 | 240 |
| Testicular germ cell cancer (TCGA, Provisional) | 26 | 149 |
| Uterine carcinosarcoma (TCGA, Provisional) | 27 | 56 |
| Uveal melanoma (TCGA, Provisional) | 28 | 80 |
The data were downloaded via the cBioPortal for Cancer Genomics
Fig. 1Performance of different classifiers. Using (a) only somatic point-mutated genes, (b) only copy number altered genes and (c) both somatic point-mutated genes and copy number altered genes as the predictors. The mean overall accuracy, with its 95 % confidence interval band, was computed using the results from 50 sets of randomly subsampled training data and their corresponding test data. For SVM-RFE and random forest, we first ranked the genes in decreasing order of their importance, before using an increasing number of them to train and test the classifiers. For L 1-logistic regression, we varied the parameter λ to control the number of genes selected. The accuracy of a random classifier is also plotted to provide a baseline for comparison. The random classifier assigns a tumour sample to the different cancer classes with probabilities proportional to the size of those classes in the training data set
Fig. 2Precision and recall of each of the 28 cancer types for the best SVM model. Here 900 top-ranked genes, consisting of both somatic point mutations and copy number alterations, were used to train the SVM. SVM support vector machine
Fig. 3Performance of the top predictor sets when both somatic point-mutated genes and copy number altered genes were used as predictors. The genes were ranked using SVM-RFE. For each top gene set of size n, we considered the (n+1)th to 2nth genes as the second best predictor set, and the (2n+1)th to 3nth genes as the third best predictor set. We then varied n and computed the accuracy of SVM for these three gene sets. SVM support vector machine
Overall accuracy of SVM for small gene sets selected by RFE
| Number of genes used | Only somatic point mutated genes as predictors | Only copy number altered genes as predictors | Somatic point-mutated genes and copy number altered genes |
|---|---|---|---|
| 10 | 28.8±0.5 | 39.3±0.8 | 40.6±0.9 |
| 20 | 35.3±0.5 | 53.4±0.4 | 61.5±0.6 |
| 50 | 44.3±0.4 | 67.7±0.4 | 77.7±0.3 |
| 70 | 47.2±0.4 | 71.7±0.3 | 81.2±0.3 |
| 100 | 49.4±0.4 | 74.7±0.3 | 83.8±0.3 |
Fig. 4Precision and recall of each of the 28 cancer types, for the SVM model trained with 50 genes chosen via stability selection. The SVM was tested on the 1661 unseen tumour samples that we set aside at the beginning for validation. SVM support vector machine
Overall accuracy of the SVM classifier trained using the genes proposed by Martinez et al. and the genes selected via SVM-RFE and stability selection in this study
| Classification task | 25-gene panel in Martinez et al. | Top 25 SVM-RFE-based SPM genes | Top 25 SVM-RFE-based SPM and CNA genes |
|---|---|---|---|
| 28 cancer types of this study | 30.4 | 39.0 | 67.7 |
| 10 cancer types of Martinez et al. | 54.6 | 57.4 | 85.4 |
The classifier was tested on 1661 unseen tumour samples
CNA copy number altered, SPM somatic point-mutated, SVM support vector machine, SVM-RFE SVM recursive feature eliminatio
Overall accuracy of the SVM classifier trained using the gene panel proposed by OncoPaD and the genes selected in this study via SVM-RFE and stability selection
| Classification task | 277 OncoPaD tier 1 genes | Top 277 SVM-RFE-based SPM genes | Top 277 SVM-RFE-based SPM and CNA genes |
|---|---|---|---|
| 28 cancer types of this study | 49.6 | 57.3 | 88.1 |
| 19 cancer types common between this study and OncoPaD | 56.0 | 63.4 | 90.3 |
The 19 tumour types that are common to our data set and OncoPaD are those labelled 1–11, 13, 14, and 16–21 in Table 1
CNA copy number altered, SPM somatic point-mutated, SVM support vector machine, SVM-RFE SVM recursive feature elimination