| Literature DB >> 31729414 |
Kanggeun Lee1, Hyoung-Oh Jeong2, Semin Lee3, Won-Ki Jeong4.
Abstract
With recent advances in DNA sequencing technologies, fast acquisition of large-scale genomic data has become commonplace. For cancer studies, in particular, there is an increasing need for the classification of cancer type based on somatic alterations detected from sequencing analyses. However, the ever-increasing size and complexity of the data make the classification task extremely challenging. In this study, we evaluate the contributions of various input features, such as mutation profiles, mutation rates, mutation spectra and signatures, and somatic copy number alterations that can be derived from genomic data, and further utilize them for accurate cancer type classification. We introduce a novel ensemble of machine learning classifiers, called CPEM (Cancer Predictor using an Ensemble Model), which is tested on 7,002 samples representing over 31 different cancer types collected from The Cancer Genome Atlas (TCGA) database. We first systematically examined the impact of the input features. Features known to be associated with specific cancers had relatively high importance in our initial prediction model. We further investigated various machine learning classifiers and feature selection methods to derive the ensemble-based cancer type prediction model achieving up to 84% classification accuracy in the nested 10-fold cross-validation. Finally, we narrowed down the target cancers to the six most common types and achieved up to 94% accuracy.Entities:
Mesh:
Year: 2019 PMID: 31729414 PMCID: PMC6858312 DOI: 10.1038/s41598-019-53034-3
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Overview of the CPEM cancer type prediction workflow. The first step in our workflow is building a set of feature vectors from the TCGA database. The feature vector consists of 22,421 gene-level mutation profiles, two mutation rates, six mutation spectra, 13,040 gene-level copy number alterations, and 96 mutation signatures. The next step is the feature vector dimension reduction by linear support vector classifier (LSVC)-based feature selection. In the final step, the two machine learning classifiers which were trained by the same selected feature set (a deep neural network and a random forest) are combined to build an ensemble model for the final cancer type prediction.
Figure 2Overview of the training and testing process of CPEM. We employ nested 10-fold cross-validation where the optimal feature selection is performed using the inner 10-fold cross-validation while the accuracy of the proposed method is evaluated using the outer 10-fold cross-validation. In the training step, two machine learning classifiers are trained with the same selected feature set by optimal feature selection. Note that we use the testing data that is not used in the feature selection optimization step in every iteration of the outer cross-validation.
Figure 3(a) Classification accuracy of a random forest classifier on each combination of feature groups. MP: Gene-level Mutation Profiles, Rate: Mutation Rates, Spec.: Mutation Spectra, SCNAs: Somatic Copy Number Alterations, Sig.: Mutation Signatures. (b) Comparison of feature selection methods. The boxplot shows the classifier accuracy measured by the inner 10-fold cross-validation for each feature selection method. It is shown that LSVC feature selection performs best on all three classifiers. (c) Classification accuracy for various feature sizes generated by the LSVC method. It is shown that the classification accuracy reaches its peak performance at around 4,000 to 8,000 features measured to optimize the DNN and random forest by the inner 10-fold cross-validation.
Figure 4Comparison of CPEM and other conventional machine learning classifiers. The boxplot shows the accuracy measured by the outer 10-fold cross-validation for each machine learning model after optimization through inner 10-fold cross-validation. The average classification accuracy of CPEM is about 6 and 11 percentage points higher than those of a fully-connected deep neural network and a random forest, respectively.
Cancer type, sample size, number of mutated genes, number of copy number-altered genes, average precision, average recall, average F1 score and average classification accuracy of 31 cancer types used in our experiment.
| Cancer type | Sample size | # of mutated genes | # of copy number altered genes | Precision (%) | Recall (%) | F1 score (%) | Accuracy (%) |
|---|---|---|---|---|---|---|---|
| Adrenocortical carcinoma (ACC) | 88 | 6,904 | 12,725 | 96.10 | 84.09 | 89.70 | 84.09 |
| Bladder urothelial carcinoma (BLCA) | 127 | 11,674 | 11,223 | 92.04 | 63.78 | 75.35 | 62.20 |
| Breast invasive carcinoma (BRCA) | 967 | 15,716 | 12,178 | 83.19 | 92.14 | 87.44 | 92.86 |
| Cervical and endocervical cancers (CESC) | 191 | 12,165 | 11,505 | 69.63 | 68.06 | 68.84 | 71.89 |
| Cholangiocarcinoma (CHOL) | 35 | 3,511 | 9,707 | 73.68 | 40.00 | 51.85 | 31.43 |
| Colorectal adenocarcinoma (COADREAD) | 220 | 14,941 | 11,509 | 87.73 | 87.73 | 87.73 | 88.18 |
| Lymphoid Neoplasm Diffuse Large B-cell Lymphoma (DLBC) | 48 | 6,457 | 9,475 | 90.91 | 62.50 | 74.07 | 58.33 |
| Esophageal carcinoma (ESCA) | 184 | 13,557 | 1,1534 | 81.18 | 58.70 | 68.35 | 63.04 |
| Glioblastoma multiforme (GBM) | 280 | 8,069 | 11,410 | 81.85 | 88.57 | 85.08 | 89.64 |
| Head and Neck squamous cell carcinoma (HNSC) | 279 | 12,783 | 11,270 | 70.65 | 74.19 | 72.38 | 77.42 |
| Kidney Chromophobe (KICH) | 66 | 3,698 | 10,431 | 96.49 | 83.33 | 89.43 | 83.33 |
| Kidney renal clear cell carcinoma (KIRC) | 410 | 9,900 | 11,562 | 88.36 | 94.39 | 91.27 | 94.39 |
| Kidney renal papillary cell carcinoma (KIRP) | 161 | 7,332 | 10,568 | 83.23 | 80.12 | 81.65 | 78.88 |
| Acute Myeloid Leukemia (LAML) | 182 | 1,369 | 7,719 | 84.50 | 92.86 | 88.48 | 93.41 |
| Brain Lower Grade Glioma (LGG) | 280 | 4,732 | 10,190 | 89.88 | 82.50 | 86.03 | 82.86 |
| Liver hepatocellular carcinoma (LIHC) | 193 | 9,985 | 11,980 | 81.59 | 68.91 | 74.72 | 66.84 |
| Lung adenocarcinoma (LUAD) | 230 | 13,931 | 11,071 | 86.51 | 80.87 | 83.60 | 79.57 |
| Lung squamous cell carcinoma (LUSC) | 177 | 13,487 | 11,212 | 73.33 | 74.58 | 73.95 | 78.53 |
| Ovarian serous cystadenocarcinoma (OV) | 311 | 8,435 | 11,684 | 81.69 | 90.35 | 85.80 | 91.96 |
| Pancreatic adenocarcinoma (PAAD) | 149 | 10,144 | 8,744 | 83.23 | 89.93 | 86.45 | 89.93 |
| Pheochromocytoma and Paraganglioma (PCPG) | 166 | 2,125 | 10,103 | 86.78 | 90.96 | 88.82 | 89.16 |
| Prostate adenocarcinoma (PRAD) | 331 | 6,116 | 9,996 | 80.16 | 90.33 | 84.94 | 87.61 |
| Sarcoma (SARC) | 243 | 7,416 | 12,099 | 86.00 | 70.78 | 77.65 | 69.14 |
| Skin Cutaneous Melanoma (SKCM) | 341 | 17,085 | 11,883 | 90.05 | 92.67 | 91.54 | 92.67 |
| Stomach adenocarcinoma (STAD) | 286 | 16,607 | 11,745 | 76.09 | 79.02 | 77.53 | 80.77 |
| Testicular Germ Cell Tumors (TGCT) | 155 | 5,904 | 9,133 | 91.02 | 98.07 | 94.41 | 98.71 |
| Thyroid carcinoma (THCA) | 403 | 3,910 | 8,721 | 85.14 | 93.80 | 89.26 | 93.80 |
| Thymoma (THYM) | 121 | 1,805 | 9,305 | 86.14 | 71.90 | 78.38 | 76.03 |
| Uterine Corpus Endometrial Carcinoma (UCEC) | 242 | 18,412 | 12,133 | 81.39 | 77.69 | 79.49 | 76.86 |
| Uterine Carcinosarcoma (UCS) | 56 | 5,508 | 11,611 | 67.74 | 37.50 | 48.28 | 35.71 |
| Uveal Melanoma (UVM) | 80 | 1,340 | 9,559 | 80.00 | 85.33 | 82.58 | 83.75 |
| Total | 7,002 | 22,421 | 13,040 | 83.43 | 78.89 | 80.49 | 84.09 |
Classification accuracy of various machine learning classifiers and their combinations with LSVC feature selection in the outer 10-fold cross validation.
| A | B | A ∩ B | A ∪ B | A−B | B−A | Ensemble (A, B) | |
|---|---|---|---|---|---|---|---|
| DNN (A), Random forest (B) | 82.25 | 73.79 | 67.81 | 88.23 | 14.44 | 5.98 | |
| DNN (A), OvR SVM (B) | 82.25 | 72.85 | 70.21 | 84.89 | 12.04 | 2.64 | 80.95 |
| DNN (A), KNN (B) | 82.25 | 48.59 | 46.30 | 84.53 | 35.95 | 2.29 | 78.81 |
| Random forest (A), OvR SVM (B) | 73.79 | 72.85 | 62.58 | 84.06 | 11.21 | 10.27 | |
| Random forest (A), KNN (B) | 73.79 | 48.59 | 43.87 | 78.51 | 29.92 | 4.71 | 67.30 |
| OvR SVM (A), KNN (B) | 72.85 | 48.59 | 44.50 | 76.94 | 28.35 | 4.08 | 68.41 |
This result confirms that the ensemble of a deep neural network and a random forest performs best.
Comparison between CPEM and previously reported cancer type classification methods.
| # of samples | # of features | # of cancer types | Precision | Recall | F1 score | Accuracy | |
|---|---|---|---|---|---|---|---|
| DeepGene[ | 3122 | 1200 | 12 | N/A | N/A | N/A | 66.50 |
| *TumorTracer[ | 2820 | 232 | 6 | 85.83 | 84.95 | 85.39 | 85.00 |
| 4975 | 560 | 10 | 72.23 | 68.98 | 70.57 | 69.00 | |
| Chen | 6751 | 101176 | 18 | 65.24 | 62.26 | 63.72 | 62.00 |
| *CPEM | 2763 | 4000 | 6 | 84.77 | 90.61 | 87.59 | 94.14 |
| 4823 | 4000 | 14 | 83.48 | 85.36 | 84.41 | 87.02 | |
| 7002 | 4000 | 31 | 83.43 | 78.89 | 80.49 | 84.06 |