| Literature DB >> 35267493 |
Mayur Divate1, Aayush Tyagi2, Derek J Richard1,3, Prathosh A Prasad2,4, Harsha Gowda5,6,7, Shivashankar H Nagaraj1,3.
Abstract
Cancer tissue-of-origin specific biomarkers are needed for effective diagnosis, monitoring, and treatment of cancers. In this study, we analyzed transcriptomics data from 37 cancer types provided by The Cancer Genome Atlas (TCGA) to identify cancer tissue-of-origin specific gene expression signatures. We developed a deep neural network model to classify cancers based on gene expression data. The model achieved a predictive accuracy of >97% across cancer types indicating the presence of distinct cancer tissue-of-origin specific gene expression signatures. We interpreted the model using Shapley additive explanations to identify specific gene signatures that significantly contributed to cancer-type classification. We evaluated the model and the validity of gene signatures using an independent test data set from the International Cancer Genome Consortium. In conclusion, we present a robust neural network model for accurate classification of cancers based on gene expression data and also provide a list of gene signatures that are valuable for developing biomarker panels for determining cancer tissue-of-origin. These gene signatures serve as valuable biomarkers for determining tissue-of-origin for cancers of unknown primary.Entities:
Keywords: cancer type prediction; deep learning; gene expression signatures; pan cancer
Year: 2022 PMID: 35267493 PMCID: PMC8909043 DOI: 10.3390/cancers14051185
Source DB: PubMed Journal: Cancers (Basel) ISSN: 2072-6694 Impact factor: 6.639
Figure 1Unsupervised clustering of the cancer genome atlas RNA-seq data using t-distributed stochastic neighbor embedding. (A) t-distributed stochastic neighbor embedding plot showing unsupervised clustering of transcriptome data from the cancer genome atlas without log transformation (B) t-distributed stochastic neighbor embedding plot showing unsupervised clustering of transcriptome data from the cancer genome atlas after log transformation.
Figure 2Architecture of deep neural network model and its progress during training. (A) The deep neural network model consists of an input layer, five hidden layers and an output layer. The input layer takes input fragment per kilobase per million mapped fragments values for 13,250 genes. Hidden layers perform nonlinear transformation using rectified linear unit activation function to distinguish between cancer types. There are 5 hidden layers with 500, 250, 125, 100 and 75 nodes. The output layer consists of 37 nodes each representing one of the cancer types. (B) The accuracy and (C) Categorical cross-entropy (loss) of deep neural network model on both training and validation data sets.
Figure 3Performance of DNN model on the test data set.
Figure 4Workflow employed to identify gene expression signatures. A schematic showing workflow employed to identify gene expression signatures using deep learning model and SHAP values.
Figure 5Cancer tissue-of-origin specific gene signatures. Supervised hierarchical clustering to show gene expression signatures associated with different cancer types.
Figure 6Representative examples of genes that exhibit cancer tissue-of-origin specific expression. Box-Whisker plots showing expression of (A) KLK3, (B) GFAP, (C) NOX1 and (D) CRYGN across 37 cancer types in TCGA data set.
Figure 7Deepcap webtool for cancer tissue-of-origin prediction and visualizing results.