| Literature DB >> 26429708 |
Andrea Marion Marquard1, Nicolai Juul Birkbak2,3, Cecilia Engel Thomas4,5, Francesco Favero6, Marcin Krzystanek7, Celine Lefebvre8, Charles Ferté9,10, Mariam Jamal-Hanjani11, Gareth A Wilson12, Seema Shafi13, Charles Swanton14,15, Fabrice André16,17, Zoltan Szallasi18,19, Aron Charles Eklund20.
Abstract
BACKGROUND: A substantial proportion of cancer cases present with a metastatic tumor and require further testing to determine the primary site; many of these are never fully diagnosed and remain cancer of unknown primary origin (CUP). It has been previously demonstrated that the somatic point mutations detected in a tumor can be used to identify its site of origin with limited accuracy. We hypothesized that higher accuracy could be achieved by a classification algorithm based on the following feature sets: 1) the number of nonsynonymous point mutations in a set of 232 specific cancer-associated genes, 2) frequencies of the 96 classes of single-nucleotide substitution determined by the flanking bases, and 3) copy number profiles, if available.Entities:
Mesh:
Year: 2015 PMID: 26429708 PMCID: PMC4590711 DOI: 10.1186/s12920-015-0130-0
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
Number of specimens available in the COSMIC whole genomes v68 database, with point mutations (PM) or with both point mutations and copy number aberrations (PM + CN), including those in the training set and those in the testing set. Categories with counts <200 were not analyzed and are omitted here
| Primary site | PM | PM + CN |
|---|---|---|
| Breast | 936 | 850 |
| Endometrium | 281 | 246 |
| Kidney | 468 | 300 |
| Large intestine | 592 | 486 |
| Liver | 415 | |
| Lung | 807 | 476 |
| Ovary | 497 | 462 |
| Pancreas | 311 | |
| Prostate | 372 | |
| Skin | 296 | |
| Total | 4975 | 2820 |
Fig. 1Classifier outline. Somatic point mutation data is used to determine the mutation status of a set of cancer genes and to calculate the distributions of 96 classes of base substitutions. When copy number profiles are available, they are used to infer any SCNAs in the same set of cancer genes. These features are combined and provided to a set of random forest classifiers, one per primary site, each of which generates a classification score. The PM classifier does not use copy number profiles and is trained to distinguish between all 10 primary sites. The PM + CN classifier does use copy number profiles (orange), but can only distinguish between 6 primary sites (white) due to less training data. Thus, blue boxes are components of the the PM classifier only, and orange boxes are components of the PM + CN classifier only, and white boxes are components of both classifiers. These sites were selected based on the availability of sufficient training data (>200 cases)
Fig. 2Cross-validation accuracy in the training data using various combinations of feature sets. Random forest ensembles were trained using the feature sets shown in the tables below each bar, and classification accuracy was evaluated by cross-validation. Sufficient SCNA data was available for only six of ten primary sites; thus we analyzed these six sites separately when including SCNAs. a Classification accuracy when excluding SCNAs and distinguishing between ten primary sites. b Classification accuracy when including SCNAs and distinguishing between six primary sites. Accuracy of individual sites are indicated by colored circles. The two combinations of feature sets selected for further analysis are indicated at the top; PM: point mutations only, PM + CN: point mutations and copy number aberrations
Fig. 3Performance of final PM classifier on the test data. a Confusion matrix of actual vs. predicted primary sites, with sensitivity, specificity, and marginal frequencies. b Performance of the final classifier in prioritizing primary sites. Each point indicates the cumulative accuracy when, for each sample, the top n highest-scoring sites are considered, or when sites are ranked by frequency or by random guess. c Classification accuracy increases with confidence score. Circles and bars indicate the accuracy and 95 % confidence interval for each bin of samples. Grey columns indicate the number of samples in each bin. d Accuracy vs. fraction of samples called. Accuracy (solid line) and 95 % confidence interval (grey region) of the corresponding fraction of tumors with highest confidence score. The fraction of tumors for which an accuracy of 95 % can be achieved is shown by a red circle with whiskers at the bottom
Fig. 4Performance of final PM + CN classifier on the test data. a–d see Fig. 3 legend
Some clinical subgroups are associated with increased or decreased performance of the primary site classifiers PM and PM + CN
| PM | PM + CN | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Primary site | Subgroup | Acc. (%) | N | P | Acc. (%) | N | P | ||
|
| |||||||||
| Breast | ER | 64 | 417 | 0.064 | 91 | 416 | 0.00033 | ** | |
| HER2 | 63 | 146 | 0.31 | 91 | 138 | 0.037 | * | ||
| TNBC | 27 | 98 | 4.1 × 109 | ** | 40 | 97 | 3.3 × 1018 | ** | |
| Endometrium | MSI | 77 | 71 | 0.015 | * | 93 | 70 | 3 × 105 | ** |
| MSS | 54 | 157 | 0.17 | 59 | 156 | 0.038 | * | ||
| Large intestine | MSI | 97 | 68 | 0.091 | 74 | 68 | 8.6 × 105 | ** | |
| MSS | 88 | 233 | 0.48 | 97 | 230 | 0.0075 | ** | ||
| Ovary | mBRCA1 | 76 | 55 | 0.097 | 96 | 55 | 0.56 | ||
| mBRCA2 | 79 | 39 | 0.077 | 97 | 38 | 0.5 | |||
| wtBRCA | 61 | 338 | 0.29 | 92 | 333 | 0.58 | |||
|
| |||||||||
| Breast | Stage I | 65 | 129 | 0.24 | 82 | 127 | 0.6 | ||
| Stage II | 59 | 437 | 0.95 | 84 | 432 | 0.93 | |||
| Stage III | 57 | 175 | 0.55 | 84 | 172 | 1 | |||
| Stage IV | 47 | 15 | 0.43 | 87 | 15 | 1 | |||
| Kidney | Stage I | 80 | 153 | 0.8 | 95 | 149 | 1 | ||
| Stage II | 81 | 32 | 1 | 91 | 32 | 0.44 | |||
| Stage III | 81 | 78 | 0.87 | 97 | 77 | 0.39 | |||
| Stage IV | 88 | 43 | 0.39 | 88 | 42 | 0.18 | |||
| Large intestine | Stage I | 89 | 65 | 0.82 | 94 | 64 | 1 | ||
| Stage II | 90 | 143 | 0.87 | 91 | 141 | 0.45 | |||
| Stage III | 89 | 101 | 0.85 | 93 | 101 | 1 | |||
| Stage IV | 94 | 49 | 0.6 | 98 | 49 | 0.35 | |||
| Lung | Stage I | 79 | 261 | 0.7 | 82 | 257 | 0.53 | ||
| Stage II | 78 | 106 | 0.69 | 84 | 105 | 0.88 | |||
| Stage III | 87 | 97 | 0.16 | 89 | 95 | 0.27 | |||
| Stage IV | 74 | 19 | 0.56 | 89 | 18 | 1 | |||
|
| |||||||||
| Endometrium | G1 | 74 | 76 | 0.055 | 88 | 76 | 0.0022 | ** | |
| G2 | 73 | 75 | 0.073 | 86 | 73 | 0.0088 | ** | ||
| G3 | 41 | 92 | 0.0013 | ** | 45 | 92 | 1.2 × 105 | ** | |
| Kidney | G1 | 71 | 7 | 0.61 | 100 | 7 | 1 | ||
| G2 | 84 | 128 | 0.68 | 93 | 125 | 0.66 | |||
| G3 | 80 | 122 | 0.68 | 96 | 120 | 0.63 | |||
| G4 | 82 | 45 | 1 | 93 | 44 | 0.73 | |||
| Ovary | G1 | 0 | 3 | 0.056 | 33 | 3 | 0.014 | * | |
| G2 | 60 | 55 | 0.77 | 87 | 54 | 0.098 | |||
| G3 | 63 | 405 | 0.83 | 95 | 394 | 0.47 | |||
| G4 | 0 | 1 | 0.38 | 100 | 1 | 1 | |||
Information on subtype, grade and stage were retrieved from TCGA, and are therefore not available for all tumors in the COSMIC database. ER estrogen receptor positive. HER2 human epidermal growth factor receptor 2 positive. TNBC triple negative breast cancer. MSI microsatellite instability. MSS microsatellite stable. mBRCA1 mutated BRCA1. mBRCA2 mutated BRCA2. wtBRCA wildtype BRCA1 and BRCA2. Acc. accuracy ie. the percentage of tumors correctly classified. N the number of tumors in subgroup. P p-value from Fisher’s exact test comparing accuracy among samples in or not in each subgroup. *p < 0.05. **p < 0.01
Fig. 5Performance of the PM classifier on independent validation data. a Tumors of various types from COSMIC v70 (n = 1669). b Metastatic breast tumors from the SAFIR01 trial (n = 91). c Multiregion-sequenced non-small cell lung cancer (n = 9). See Fig. 3b legend. For comparison, the expected performance of our method in each data set was estimated according to the distribution of primary sites and the site-specific accuracies on test data
Fig. 6Consistency of the PM classifier on data from multiple samples from the same tumor. The classifier was applied to 24 specimens from 9 NSCLC patients, including primary regions (R) and lymph node metastases (L). The proposed primary site is indicated by color along with the confidence score