| Literature DB >> 31754222 |
Yingshuai Sun1, Sitao Zhu2, Kailong Ma3, Weiqing Liu1, Yao Yue1, Gang Hu1, Huifang Lu3, Wenbin Chen4.
Abstract
Cancer is a major cause of death worldwide, and an early diagnosis is required for a favorable prognosis. Histological examination is the gold standard for cancer identification; however, large amount of inter-observer variability exists in histological diagnosis. Numerous studies have shown cancer genesis is accompanied by an accumulation of harmful mutations, potentiating the identification of cancer based on genomic information. We have proposed a method, GDL (genome deep learning), to study the relationship between genomic variations and traits based on deep neural networks. We analyzed 6,083 samples' WES (Whole Exon Sequencing) mutations files from 12 cancer types obtained from the TCGA (The Cancer Genome Atlas) and 1,991 healthy samples' WES data from the 1000 Genomes project. We constructed 12 specific models to distinguish between certain type of cancer and healthy tissues, a total-specific model that can identify healthy and cancer tissues, and a mixture model to distinguish between all 12 types of cancer based on GDL. We demonstrate that the accuracy of specific, mixture and total specific model are 97.47%, 70.08% and 94.70% for cancer identification. We developed an efficient method for the identification of cancer based on genomic information that offers a new direction for disease diagnosis.Entities:
Mesh:
Year: 2019 PMID: 31754222 PMCID: PMC6872744 DOI: 10.1038/s41598-019-53989-3
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1The architecture of genomic deep learning (GDL). The Mutation Collection used as reference. Point mutation transform the data and label through Sn rule.
Summary information of datasets from the TCGA and the 1000 Genome Project that were used in this study.
| Cancer type | Samples | SNVs files | Age | Gender | Race | Tumor stage | Vital Status | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| (N) | (N) | (mean ± s.d.) | Male | Female | White | American | Asian | NA | I | II | III | IV | NA | Alive | Deceased | NA | |
| (%) | (%) | (N) | (N) | (N) | (N) | (N) | (N) | (N) | (N) | (N) | (N) | (N) | (N) | ||||
| BLCA | 412 | 425 | 73.1 ± 10.5 | 73.79 | 26.21 | 327 | 23 | 44 | 18 | 2 | 131 | 141 | 136 | 2 | 230 | 182 | 0 |
| BRCA | 1044 | 1080 | 67.0 ± 13.1 | 1.05 | 98.95 | 719 | 180 | 59 | 86 | 173 | 588 | 241 | 20 | 22 | 898 | 146 | 0 |
| COAD | 433 | 493 | 74.5 ± 13.6 | 51.97 | 48.03 | 212 | 59 | 11 | 151 | 90 | 166 | 118 | 46 | 13 | 332 | 99 | 2 |
| GBM | 396 | 498 | 63.1 ± 13.2 | 63.36 | 36.64 | 337 | 41 | 6 | 12 | 0 | 0 | 0 | 0 | 396 | 88 | 303 | 5 |
| KIRC | 339 | 376 | 69.1 ± 12.0 | 64.60 | 35.40 | 275 | 52 | 6 | 6 | 193 | 33 | 2 | 69 | 42 | 258 | 81 | 0 |
| LGG | 513 | 530 | 49.6 ± 12.8 | 55.27 | 44.73 | 472 | 22 | 8 | 11 | 0 | 0 | 0 | 0 | 513 | 386 | 126 | 1 |
| LUSC | 497 | 561 | 73.4 ± 9.1 | 73.84 | 26.16 | 348 | 30 | 9 | 110 | 242 | 160 | 84 | 7 | 4 | 279 | 218 | 0 |
| OV | 443 | 610 | 66.6 ± 11.7 | 0 | 100 | 376 | 31 | 14 | 22 | 0 | 0 | 0 | 0 | 443 | 188 | 253 | 2 |
| PRAD | 498 | 503 | 69.0 ± 7.1 | 100 | 0 | 147 | 7 | 2 | 342 | 0 | 0 | 0 | 0 | 498 | 488 | 10 | 0 |
| SKCM | 470 | 472 | 66.1 ± 14.9 | 61.70 | 38.30 | 447 | 1 | 12 | 10 | 77 | 140 | 185 | 23 | 45 | 249 | 221 | 0 |
| THCA | 496 | 504 | 55.5 ± 15.5 | 26.41 | 73.59 | 325 | 27 | 51 | 93 | 331 | 51 | 110 | 2 | 2 | 482 | 14 | 0 |
| UCEC | 542 | 561 | 72.2 ± 11.2 | 0 | 100 | 371 | 119 | 20 | 32 | 0 | 0 | 0 | 0 | 542 | 451 | 91 | 0 |
| IGSR | 1991 | 1991 | ∼ | ∼ | ∼ | ∼ | ∼ | ∼ | ∼ | ∼ | ∼ | ∼ | ∼ | ∼ | ∼ | ∼ | |
Figure 2Cancer identification performance of 12 specific models and the mixture model. (a) The classification performance of 12 specific models. Using different thresholds, the sensitivity is the abscissa and the specificity is the ordinate, resulting in 12 ROC curves. The 12 ROC curves produce perfect classification results, and the area under the ROC curve (AUC) is greater than 96%. (b) Confusion matrix of the mixture mode. The abscissa indicates the label, and the ordinate indicates the predicted cancer type. LUSC is more obvious in the predictions, especially in the BLCA predictions, suggesting that many cancers are easily confused with LUSC. Cancers that are easily confused in model predictions may be similar in their genetic variations. (c) The accuracy of top-N at different forecasted quantities. The abscissa indicates different prediction numbers, and the ordinate indicates accuracy. The accuracy of the prediction result is 70.08%, and the accuracy of two prediction results is 83.20%, which provides support for the practical application of the model. The abscissa indicates the label, and the ordinate indicates the predicted cancer type.
Summary of the GDL model classification performances.
| Cancer type | Raw data | Filter | Accuracy (%) | Sensitivity (%) | Specificity (%) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| ALL | Train Data | Test Data | |||||||||
| Cancer | Health | Cancer | Health | Cancer | Health | Cancer | Health | ||||
| BLCA | 425 | 1991 | 417 | 216 | 341 | 165 | 76 | 51 | 98.43 | 98.68 | 98.04 |
| BRCA | 1080 | 1991 | 1073 | 586 | 856 | 471 | 217 | 115 | 98.19 | 97.70 | 99.13 |
| COAD | 493 | 1991 | 482 | 842 | 385 | 675 | 97 | 167 | 99.24 | 98.97 | 99.40 |
| GBM | 498 | 1991 | 478 | 435 | 385 | 345 | 93 | 90 | 97.81 | 96.77 | 98.89 |
| KIRC | 376 | 1991 | 372 | 189 | 288 | 161 | 84 | 28 | 100.00 | 100.00 | 100.00 |
| LGG | 530 | 1991 | 518 | 491 | 410 | 397 | 108 | 94 | 99.01 | 99.07 | 98.94 |
| LUSC | 561 | 1991 | 545 | 166 | 436 | 133 | 109 | 33 | 100.00 | 100.00 | 100.00 |
| OV | 610 | 1991 | 600 | 176 | 481 | 140 | 119 | 36 | 100.00 | 100.00 | 100.00 |
| PRAD | 503 | 1991 | 494 | 497 | 399 | 394 | 95 | 103 | 97.47 | 95.79 | 99.03 |
| SKCM | 472 | 1991 | 434 | 409 | 358 | 316 | 76 | 93 | 98.22 | 97.37 | 98.92 |
| THCA | 504 | 1991 | 503 | 241 | 405 | 190 | 98 | 51 | 97.99 | 97.96 | 98.04 |
| UCEC | 561 | 1991 | 549 | 1217 | 446 | 967 | 103 | 250 | 98.02 | 98.06 | 98.00 |
| TOTAL | 6613 | 1991 | 5733 | 1629 | 4585 | 1304 | 1148 | 325 | 94.70 | 97.30 | 85.54 |
Figure 3Mixed matrices of the same dimensions for different cancers. UCEC and COAD share the largest number of variant sites, followed by UCEC and BRCA. BRCA and COAD are relatively more common than other types of cancer.