| Literature DB >> 25425966 |
Iris H Wei1, Yang Shi2, Hui Jiang2, Chandan Kumar-Sinha3, Arul M Chinnaiyan4.
Abstract
Metastatic cancer of unknown primary (CUP) accounts for up to 5% of all new cancer cases, with a 5-year survival rate of only 10%. Accurate identification of tissue of origin would allow for directed, personalized therapies to improve clinical outcomes. Our objective was to use transcriptome sequencing (RNA-Seq) to identify lineage-specific biomarker signatures for the cancer types that most commonly metastasize as CUP (colorectum, kidney, liver, lung, ovary, pancreas, prostate, and stomach). RNA-Seq data of 17,471 transcripts from a total of 3,244 cancer samples across 26 different tissue types were compiled from in-house sequencing data and publically available International Cancer Genome Consortium and The Cancer Genome Atlas datasets. Robust cancer biomarker signatures were extracted using a 10-fold cross-validation method of log transformation, quantile normalization, transcript ranking by area under the receiver operating characteristic curve, and stepwise logistic regression. The entire algorithm was then repeated with a new set of randomly generated training and test sets, yielding highly concordant biomarker signatures. External validation of the cancer-specific signatures yielded high sensitivity (92.0% ± 3.15%; mean ± standard deviation) and specificity (97.7% ± 2.99%) for each cancer biomarker signature. The overall performance of this RNA-Seq biomarker-generating algorithm yielded an accuracy of 90.5%. In conclusion, we demonstrate a computational model for producing highly sensitive and specific cancer biomarker signatures from RNA-Seq data, generating signatures for the top eight cancer types responsible for CUP to accurately identify tumor origin.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25425966 PMCID: PMC4240918 DOI: 10.1016/j.neo.2014.09.007
Source DB: PubMed Journal: Neoplasia ISSN: 1476-5586 Impact factor: 5.715
Figure 1Algorithm for extracting optimal cancer biomarker signatures from RNA-Seq dataset. AUC, area under the receiver operating characteristic curve.
Allocation of Cancer Samples to RNA-Seq Training and Test Sets
| Cancer Type | All | Training Set | Test Set |
|---|---|---|---|
| Adrenal gland | 3 | 3 | 0 |
| Acute myeloid leukemia | 174 | 50 | 124 |
| Bladder | 70 | 50 | 20 |
| Breast | 864 | 50 | 814 |
| Cervix | 8 | 8 | 0 |
| Colorectum | 244 | 50 | 194 |
| Endometrium | 333 | 50 | 283 |
| Germ cell | 1 | 1 | 0 |
| Kidney | 24 | 24 | 0 |
| Liver | 15 | 15 | 0 |
| Head and neck | 263 | 50 | 213 |
| Lung | 348 | 50 | 298 |
| Lymphoma | 11 | 11 | 0 |
| Medulloblastoma | 1 | 1 | 0 |
| Melanoma | 136 | 50 | 86 |
| Merkel cell | 3 | 3 | 0 |
| Myeloproliferative neoplasm | 9 | 9 | 0 |
| Neuroblastoma | 2 | 2 | 0 |
| Neuroepithelioma | 1 | 1 | 0 |
| Oropharynx | 4 | 4 | 0 |
| Ovary | 418 | 50 | 368 |
| Pancreas | 76 | 50 | 26 |
| Prostate | 154 | 50 | 104 |
| Rhabdomyosarcoma | 1 | 1 | 0 |
| Salivary gland | 4 | 4 | 0 |
| Stomach | 77 | 50 | 27 |
Figure 2Scatter plots of 10-fold cross-validation method to determine optimal biomarker signatures for 8 different cancer types. Points highlighted in red indicate the highest, cross-validated AUC for each cancer type. AUC, area under the receiver operating characteristic curve.
Cancer Biomarker Signatures Generated from Two Separate Randomizations
Transcripts highlighted in red are common between the two signatures.
Figure 3Internal validation of eight cancer-specific biomarker signatures yields high area under the ROC curve values. The entire 688-sample RNA-Seq training set was used as the test set for each cancer signature. Dotted lines indicate lines of identity. Points of minimum distance to (0,1) are highlighted in red. ROC, receiver operating characteristic; AUC, area under the ROC curve.
External Validation of Eight Cancer-Specific Biomarker Signatures Using 2,556-Sample RNA-Seq Test Set
| Colorectum | Kidney | Liver | Lung | Ovary | Pancreas | Prostate | Stomach | |
|---|---|---|---|---|---|---|---|---|
| TP | 174 | 0 | 0 | 274 | 360 | 24 | 95 | 24 |
| TN | 2355 | 2533 | 2552 | 2222 | 2188 | 2389 | 2424 | 2305 |
| FP | 7 | 23 | 4 | 36 | 0 | 141 | 28 | 224 |
| FN | 20 | 0 | 0 | 24 | 8 | 2 | 9 | 3 |
TP, true positive; TN, true negative; FP, false positive; FN, false negative.
Figure 4RNA-Seq heat map of 8 cancer-specific biomarker signatures (rows) across all 3,244 cancer samples (columns).
Overall Performance of RNA-Seq Biomarker Generating Algorithm for Predicting Tissue of Origin in 2,556 Cancer Samples
| Samples | % | ||
|---|---|---|---|
| TP | 946 | Sensitivity | 95.0 |
| TN | 1366 | Specificity | 87.6 |
| FP | 194 | PPV | 83.0 |
| FN | 50 | NPV | 96.5 |
| Accuracy | 90.5 |
Samples with duplicate cancer predictions were assigned the identity with the highest predicted value. TP, true positive; TN, true negative; FP, false positive; FN, false negative; PPV, positive predictive value; NPV, negative predictive value.