| Literature DB >> 28686574 |
Dan Søndergaard1, Svend Nielsen1, Christian N S Pedersen1, Søren Besenbacher1.
Abstract
A cancer of unknown primary (CUP) is a metastatic cancer for which standard diagnostic tests fail to identify the location of the primary tumor. CUPs account for 3-5% of cancer cases. Using molecular data to determine the location of the primary tumor in such cases can help doctors make the right treatment choice and thus improve the clinical outcome. In this paper, we present a new method for predicting the location of the primary tumor using gene expression data: locating cancers of unknown primary (LoCUP). The method models the data as a mixture of normal and tumor cells and thus allows correct classification even in impure samples, where the tumor biopsy is contaminated by a large fraction of normal cells. We find that our method provides a significant increase in classification accuracy (95.8% over 90.8%) on simulated low-purity metastatic samples and shows potential on a small dataset of real metastasis samples with known origin.Entities:
Keywords: cancer of unknown origin; classification; precision medicine; transcriptomics
Mesh:
Year: 2017 PMID: 28686574 PMCID: PMC6042823 DOI: 10.1515/jib-2017-0013
Source DB: PubMed Journal: J Integr Bioinform ISSN: 1613-4516
Figure 1:Example of the model in a 2-dimensional space using simulated data. The normal tissue samples belong to an ordinary normal distribution with “Normal Centroid” as mean. The tumor samples produce an elongated shape because impurity drags them towards the normal tissue centroid.
Figure 2:Distribution of estimated purities and the fitted beta distributions. Estimated shape parameters are shown as Beta (β1, β2). We observe that a beta distribution is a good fit for the tumor purity estimates.
Results for the primary (P) and simulated (S) experiments for both methods.
| Accuracy (%) | Best parameters | |||||
|---|---|---|---|---|---|---|
| Method | CV | Validation | Dimensionality reduction | Number of components | Regularization factor | |
|
|
| 94.9 | 95.2 | LDA | 221 | 819.2 |
|
| 96.4 | 97.2 | LDA | 15 | 0.05 | |
|
|
| 96.3 | 95.5 | LDA | 55 | 102.4 |
|
| 91.1 | 90.8 | LDA | 105 | 0.1 | |
The best parameters were found through a grid search for each experiment and method. On simulated metastatic data, our method clearly outperforms the MLRR method. Note that we in some cases obtain a higher accuracy on the validation data since more training data is available.
Figure 3:Relationship between the datasets used for test (grid search and cross-validation) and validation. Simulated datasets are shown in blue. Datasets derived during cross-validation (only a single fold is shown) are shown in red.
Prediction on dataset of real metastatic samples with known primary tumor (D5).
| # | Prediction | True | |||||||
|---|---|---|---|---|---|---|---|---|---|
| LoCUP | MLRR | Est. | Normal | Tumor | |||||
| 1st | 2nd | 3rd | 1st | 2nd | 3rd | ||||
| 1 |
| KICH | KIRP |
| CESC | LUSC | 0.99 | LIHC |
|
| 2 |
| LIHC | CESC |
| COAD | CESC | 0.50 | LUAD |
|
| 3 |
| COAD | UCEC |
| COAD | CESC | 0.39 | LIHC |
|
| 4 |
| CESC | BLCA |
| LIHC | CESC | 0.53 | LIHC |
|
| 5 |
| SKCM | BLCA |
| CESC | BLCA | 0.99 | LIHC |
|
| 6 |
| CESC | BRCA |
| CESC | BRCA | 0.96 | LIHC |
|
| 7 |
| CESC | BLCA |
| LIHC | CESC | 0.50 | LIHC |
|
| 8 |
| CESC | UCEC |
| KIRC | CESC | 0.82 | LIHC |
|
Our method correctly predicts five of eight samples while MLRR correctly predicts four of eight samples. Sample 2 is correctly predicted by LoCUP, while MLRR predicts LUAD. Note that the second-best scoring LoCUP prediction for sample 3 is also correct. However, MLRR also predicts correctly on sample 2 and 3 when considering the second-best prediction. Sample 1 may be a polluted or mislabeled sample.
Figure 4:Accuracy for the LoCUP and MLRR methods binned by the true α of the simulated samples in D4. The number of samples in each bin is shown in bold. Our method outperforms MLRR on samples where α∈(0, 2, 0.7), that is low-purity samples.