| Literature DB >> 35698059 |
Steven Gore1, Rajeev K Azad2,3.
Abstract
BACKGROUND: Despite remarkable advances in cancer research, cancer remains one of the leading causes of death worldwide. Early detection of cancer and localization of the tissue of its origin are key to effective treatment. Here, we leverage technological advances in machine learning or artificial intelligence to design a novel framework for cancer diagnostics. Our proposed framework detects cancers and their tissues of origin using a unified model of cancers encompassing 33 cancers represented in The Cancer Genome Atlas (TCGA). Our model exploits the learned features of different cancers reflected in the respective dysregulated epigenomes, which arise early in carcinogenesis and differ remarkably between different cancer types or subtypes, thus holding a great promise in early cancer detection.Entities:
Keywords: Cancer; Deep learning; Metastatic cancer; Neural network
Mesh:
Year: 2022 PMID: 35698059 PMCID: PMC9195411 DOI: 10.1186/s12859-022-04783-y
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Fig. 1The CancerNet architecture. Methylation data are input to the encoder. The encoder is composed of two dense feedforward layers using the Relu activation function. Output of the encoder is passed to the probabilistic layer, which passes its output to the classifier and generator/decoder. The classifier is two dense feedforward layers, the first with the ReLu activation function and the second with the softmax activation function. The decoder is two dense feedforward layers, the first using the Relu activation and the second using the sigmoid activation
CancerNet’s performance in detecting the tissue of origin of 33 cancers
| Class | Precision | Recall | F1 |
|---|---|---|---|
| ACC | 0.99925 | 0.999249 | 0.999227 |
| BLCA | 0.998525 | 0.998498 | 0.998507 |
| BRCA | 0.998498 | 0.998498 | 0.998498 |
| CESC | 0.997868 | 0.997748 | 0.997784 |
| CHOL | 0.998248 | 0.997748 | 0.997973 |
| COAD | 0.98915 | 0.989489 | 0.989229 |
| DLBC | 0.998949 | 0.998874 | 0.998903 |
| ESCA | 0.989043 | 0.988739 | 0.988885 |
| GBM | 0.991661 | 0.991742 | 0.991697 |
| HNSC | 0.991985 | 0.992117 | 0.99203 |
| KICH | 0.999625 | 0.999625 | 0.999618 |
| KIRC | 0.997748 | 0.997748 | 0.997748 |
| KIRP | 0.997793 | 0.997748 | 0.997765 |
| LAML | 0.999625 | 0.999625 | 0.999621 |
| LGG | 0.991407 | 0.991366 | 0.991386 |
| LIHC | 0.997935 | 0.997748 | 0.997795 |
| LUAD | 0.997125 | 0.996997 | 0.997033 |
| LUSC | 0.994668 | 0.994745 | 0.994648 |
| MESO | 0.998875 | 0.998874 | 0.998829 |
| OV | 0.997384 | 0.997372 | 0.997377 |
| PAAD | 0.997312 | 0.997372 | 0.997321 |
| PCPG | 0.999635 | 0.999625 | 0.999627 |
| PRAD | 0.996604 | 0.996622 | 0.996567 |
| READ | 0.993231 | 0.99024 | 0.991447 |
| SARC | 0.998109 | 0.998123 | 0.998115 |
| SKCM | 0.99817 | 0.998123 | 0.998138 |
| STAD | 0.993135 | 0.993243 | 0.993174 |
| TGCT | 1 | 1 | 1 |
| THCA | 0.99852 | 0.998498 | 0.998506 |
| THYM | 0.997281 | 0.997372 | 0.997296 |
| UCEC | 0.993519 | 0.993619 | 0.99353 |
| UCS | 0.996655 | 0.996246 | 0.996433 |
| UVM | 0.999625 | 0.999625 | 0.999619 |
| NORM | 0.987758 | 0.987613 | 0.987665 |
A normal class (NORM) is also included. The performance was assessed using the accuracy metrics precision, recall and F1-measure
ACC—Adrenocortical carcinoma, BLCA—Bladder urothelial carcinoma, BRCA—Breast invasive carcinoma, CESC—Cervical squamous cell carcinoma and endocervical adenocarcinoma, CHOL—Cholangiocarcinoma, COAD—Colon adenocarcinoma, DLBC—Lymphoid neoplasm diffuse large B-cell lymphoma, ESCA—Esophageal carcinoma, GBM—Glioblastoma multiforme, HNSC—Head and neck squamous cell carcinoma, KICH—Kidney chromophobe, KIRC—Kidney renal clear cell carcinoma, KIRP—Kidney renal papillary cell carcinoma, LAML—Acute myeloid leukemia, LGG—Brain lower grade glioma, LIHC—Liver hepatocellular carcinoma, LUAD—Lung adenocarcinoma, LUSC—Lung squamous cell carcinoma, MESO—Mesothelioma, OV—Ovarian serous cystadenocarcinoma, PAAD—Pancreatic adenocarcinoma, PCPG—Pheochromocytoma and paraganglioma, PRAD—Prostate adenocarcinoma, READ—Rectum adenocarcinoma, SARC—Sarcoma, SKCM—Skin cutaneous melanoma, STAD—Stomach adenocarcinoma, TGCT—Testicular germ cell tumors, THCA—Thyroid carcinoma, THYM—Thymoma, UCEC—Uterine corpus endometrial carcinoma, UCS—Uterine carcinosarcoma, UVM—Uveal melanoma, NORM—Normal (non-cancer)
Fig. 2Misclassification rates for 4 cancer types selected to illustrate trends observed in CancerNet. A COAD misclassifies primarily to READ with fewer misclassifications in ESCA and STAD. B ESCA misclassifies to HNSC, LUSC and STAD. Lung misclassifications occur often among some sample types. C OV samples misclassify as the two uterine cancer types considered in CancerNet: UCEC and UCS. D LIHC misclassifies as CHOL, MESO, SKCM and NORM. Refer to Abbreviations for cancer types indicated
Fig. 3Confusion matrix of TCGA primary tumor classification. Primary tumors across 33 TCGA cancer types were classified. The correct class is shown by the Y-axis and the predicted class is shown by the X-axis (refer to Abbreviations for different cancers indicated on the X-axis; normal is abbreviated NORM)
Fig. 4Visualization of test samples in the latent space. T-SNE was used to reduce the latent space dimension from 100 to 2. Samples originating from the same tissue form cluster(s) and are close to sample groups of similar tissues. For abbreviations, refer to the full abbreviation list. Normal samples are abbreviated NORM and are displayed in gray
Fig. 5Renal subtype latent space distribution. A Samples representing different renal subtypes, as determined by the TCGA analysis of renal cancers, are mapped onto the latent space. Clear separation of subtypes PRCC T1 and T2 and ChRCC indicates that the neural network has learned features for discriminating between these renal subtypes. B Separation of renal samples in the latent space by gender
Fig. 6Gastric adenocarcinoma latent space distribution. Samples representing different gastric adenocarcinomas cluster in the latent space by A body site of tumor and B hypomethylation status
Fig. 7The tenfold cross-validation accuracy of a binary SVM with a linear kernel trained on the 100-dimensional latent space for each body site (A) and methylation status (B) for gastric adenocarcinoma. The linear kernel is used to test the separability of each in the full 100-dimensional latent space. The high performance of these models indicates that the body sites and methylation statuses are not overlapping in the higher dimensional latent space even though it may appear so in lower dimension representations (Fig. 6)
Fig. 8Squamous cell carcinoma latent space distribution. Squamous cell carcinoma samples tend to cluster in the latent space by their tissues of origin and by A HPV status but not by B smoker status