| Literature DB >> 34991439 |
Tulika Kakati1,2, Dhruba K Bhattacharyya2, Jugal K Kalita3, Trina M Norden-Krichmar4.
Abstract
BACKGROUND: A limitation of traditional differential expression analysis on small datasets involves the possibility of false positives and false negatives due to sample variation. Considering the recent advances in deep learning (DL) based models, we wanted to expand the state-of-the-art in disease biomarker prediction from RNA-seq data using DL. However, application of DL to RNA-seq data is challenging due to absence of appropriate labels and smaller sample size as compared to number of genes. Deep learning coupled with transfer learning can improve prediction performance on novel data by incorporating patterns learned from other related data. With the emergence of new disease datasets, biomarker prediction would be facilitated by having a generalized model that can transfer the knowledge of trained feature maps to the new dataset. To the best of our knowledge, there is no Convolutional Neural Network (CNN)-based model coupled with transfer learning to predict the significant upregulating (UR) and downregulating (DR) genes from both trained and untrained datasets.Entities:
Keywords: Classification; Convolutional neural network; Differentially expressed genes; Disease biomarkers; Transfer learning
Mesh:
Year: 2022 PMID: 34991439 PMCID: PMC8734099 DOI: 10.1186/s12859-021-04527-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Workflow of DEGnext methodology. In this workflow, there are three main phases. The first phase involves data collection, preprocessing, labelling, and splitting of the data. Here, we split the data into two parts: non-biologically validated data (“non-bio data” or P) and the biologically validated data (“bio data” or Q). T1 is the non-biologically validated train data of P (“non-bio train data”, 80% of P). T2 is the non-biologically validated test data of P (“non-bio test data”, 20% of P). F1 is the fine-tune data of biologically validated data (“fine-tune data”, 80% of Q). T3 is the biologically validated test data of Q (“bio-test data”, 20% of Q). The second phase includes training (first level training) and fine-tuning (second level training) and testing of CNN model to predict UR and DR genes. The third phase includes downstream enrichment analyses of the predicted UR and DR genes to identify potential biomarkers related to a cancer dataset. The CNN architecture is illustrated in Fig. 2
Fig. 2CNN architecture of DEGnext. The input to the model is a 1D input vector (, , , ), which represents each gene row of a cancer dataset. This 1D vector is converted to a 2D matrix of channel 1 using np.reshape(). We used a sequence of eight 2D convolutional neural network (CNN) layers (, , , ) with ReLU() as activation function. Each CNN layer uses kernel-size (3, 3), stride of 1, and padding equal to 1. We used a 2D Maxpool layer of kernel-size 2. In order to make the model inclusive for any input size, we used a 2D AdaptiveMaxPool layer with target output size of 1 1. The output of the CNN layers is fed to a sequence of 5 linear layers (, , , ) with ReLU() as activation function. We used Softmax() to the output of linear layers, to find the probabilities of each class in the range of [0, 1]
Dataset abbreviations for cancer datasets used in DEGnext
| Dataset abbreviation | Cancer type | Dataset abbreviation | Cancer type |
|---|---|---|---|
| BLCA | Bladder urothelial carcinoma | LIHC | Liver hepatocellular carcinoma |
| BRCA | Breast invasive carcinoma | LUAD | Lung adenocarcinoma |
| CHOL | Cholangiocarcinoma | LUSC | Lung squamous cell carcinoma |
| COAD | Colon adenocarcinoma | PRAD | Prostate adenocarcinoma |
| ESCA | Esophageal carcinoma | READ | Rectum adenocarcinoma |
| HNSC | Head and neck squamous cell carcinoma | STAD | Stomach adenocarcinoma |
| KICH | Kidney Chromophobe | THCA | Thyroid carcinoma |
| KIRC | Kidney renal clear cell carcinoma | UCEC | Uterine Corpus endometrial carcinoma |
| KIRP | Kidney renal papillary cell carcinoma | – | – |
Performance of DEGnext on bio-test data (T3) of all 17 datasets using general learning considering fivefold cross validation
| Dataset | Accuracy | Recall | Precision | F-measure | MCC |
|---|---|---|---|---|---|
| BLCA | 98.42 | 98.42 | 98.49 | 0.98 | 0.97 |
| BRCA | 98.80 | 98.80 | 98.83 | 0.99 | 0.98 |
| 100.00 | 100.00 | 100.00 | 1.00 | 1.00 | |
CHOL COAD | 99.64 | 99.64 | 99.65 | 1.00 | 0.99 |
| ESCA | 97.95 | 97.95 | 98.10 | 0.98 | 0.96 |
| HNSC | 99.32 | 99.32 | 99.34 | 0.99 | 0.98 |
| KICH | 100.00 | 100.00 | 100.00 | 1.00 | 1.00 |
| KIRC | 99.78 | 99.78 | 99.78 | 1.00 | 1.00 |
| KIRP | 100.00 | 100.00 | 100.00 | 1.00 | 1.00 |
| LIHC | 95.93 | 95.93 | 96.23 | 0.96 | 0.85 |
| LUAD | 99.82 | 99.82 | 99.83 | 1.00 | 1.00 |
| LUSC | 99.88 | 99.88 | 99.88 | 1.00 | 1.00 |
| PRAD | 99.35 | 99.35 | 99.36 | 0.99 | 0.99 |
| READ | 95.39 | 95.39 | 96.54 | 0.95 | 0.92 |
| STAD | 96.89 | 96.89 | 97.06 | 0.97 | 0.93 |
| THCA | 99.87 | 99.87 | 99.87 | 1.00 | 1.00 |
| UCEC | 99.60 | 99.60 | 99.61 | 1.00 | 0.99 |
Performance of DEGnext on biologically validated data (Q) of 8 testing or untrained datasets using transfer learning
| Dataset | Accuracy | Recall | Precision | F-measure | MCC |
|---|---|---|---|---|---|
| BLCA | 95.69 | 95.69 | 95.76 | 0.96 | 0.91 |
| CHOL | 98.26 | 98.26 | 98.49 | 0.98 | 0.94 |
| COAD | 84.21 | 84.21 | 88.44 | 0.84 | 0.72 |
| ESCA | 92.97 | 92.97 | 94.68 | 0.93 | 0.61 |
| HNSC | 98.44 | 98.44 | 98.49 | 0.98 | 0.96 |
| KICH | 98.75 | 98.75 | 98.79 | 0.99 | 0.97 |
| READ | 86.05 | 86.05 | 89.23 | 0.86 | 0.75 |
| STAD | 97.77 | 97.77 | 97.89 | 0.98 | 0.95 |
Fig. 3ROC curves of 17 datasets for general learning. Comparison of ROC scores for general learning of bio-test data (T3) for all 17 datasets
Fig. 4ROC curves of 8 untrained datasets for transfer learning. Comparison of ROC scores for transfer learning on 100 of bio data data (Q) for all test datasets, namely BLCA, CHOL, COAD, ESCA, HNSC, KICH, READ, and STAD
Fig. 5Robustness comparison for DEGnext. A Robustness of DEGnext to noisy data for bio-test data (T3) of all the 17 datasets. B Comparison with other ML methods in terms of the mean accuracy for bio-test data (T3) of all the 17 datasets in presence of different levels of Gaussian noise
Analysis of GO enrichment of predicted UR and DR genes for BRCA and UCEC datasets
| Dataset | GO ID/attribute | ||
|---|---|---|---|
| BRCA | Cellular component morphogenesis | 1.72E−06 | 8.39E−03 |
| Cellular response to endogenous stimulus | 5.19E−06 | 8.39E−03 | |
| Cell adhesion | 5.32E−06 | 8.39E−03 | |
| Biological adhesion | 6.15E−06 | 8.39E−03 | |
| Cell morphogenesis | 9.77E−06 | 1.07E−02 | |
| Negative regulation of response to stimulus | 1.25E−05 | 1.14E−02 | |
| Negative regulation of intracellular signal transduction | 4.77E−05 | 3.72E−02 | |
| UCEC | Reproductive process | 1.19E−05 | 2.61E−02 |
| Reproduction | 1.24E−05 | 2.61E−02 | |
| Positive regulation of plasminogen activation | 3.18E−05 | 4.46E−02 |
Ten significant pathways mapped from predicted UR and DR genes of BRCA and UCEC
| Cancer | Ingenuity canonical pathways | Mapped predicted UR genes | Mapped predicted DR genes |
|---|---|---|---|
| BRCA | RhoGDI signaling | ARHGEF17, CDH18, CDH5, FNBP1, PPP1R12C, RDX, RHOQ | CREBBP, RHOB, CD44, CDH6, SRC, ESR1, RAC1 |
| ILK signaling | CCND1, FNBP1, ITGB7, MYH6, RHOQ, VIM | CREBBP, MYH11, CREB3, RHOB, IRS2, ACTN2, RAC1 | |
| Glioblastoma multiforme signaling | CCND1, FNBP1, FZD7, ITPR1, PLCZ1, RHOQ | CDK6, CDKN1A, PLCH2, RHOB, SRC, RAC1 | |
| Leukocyte extravasation signaling | CDH5, PRKCG, RDX, VCAM1 | PRKCH, CLDN12, CD44, SRC, MMP27, CLDN2, RAP1GAP, ACTN2, RAC1 | |
| Wnt/ | CCND1, CDH5,FZD7, PIN1, TGFBR1, TLE4 | CREBBP, CD44, SRC, CSNK1D, DVL3, POU5F1 | |
| Cholecystokinin/gastrin-mediated signaling | FNBP1, ITPR1, PRKCG, RHOQ | PRKCH, RHOB, SRC, CCKBR, RAC1 | |
| Factors promoting cardiogenesis in vertebrates | CCND1, FZD7, MYH6, PLCZ1, PRKCG, TGFBR1 | CREBBP, CREB3, PRKCH, PLCH2 | |
| Wnt/Ca+ pathway | FZD7,PLCZ1 | CREBBP, CREB3, PLCH2, DVL3 | |
| Dopamine-DARPP32 feedback in cAMP signaling | GRIN2D, ITPR1, PLCZ1, PRKCG | CREBBP, CREB3, PRKCH, PLCH2, CSNK1D, CACNA1S | |
| UVC-induced MAPK signaling | PRKCG, SMPD1 | PRKCH, ARAF, SRC | |
| UCEC | PTEN signaling | ITGA4, MCRS1, SOS1 | INPP5K, CBL |
| Ephrin receptor signaling | ATF4, ITGA4, SOS1 | CREBBP, EPHA6 | |
| Integrin signaling | ACTN1, CAPN2, ITGA4, SOS1 | ZYX, ITGA2B, ITGB8 | |
| ERK/MAPK signaling | ATF4, DUSP9, ITGA4, KSR1, NFATC1,SOS1 | CREBBP | |
| PPAR signaling | PPARA, SOS1 | CREBBP, TNFRSF11B, NCOR2 | |
| FLT3 signaling in hematopoietic progenitor cells | ATF4, SOS1 | CREBBP, CBL | |
| Calcium signaling | ATF4, ATP2B1, MYH10, NFATC1 | CREBBP, CACNA1C | |
| ILK signaling | ACTN1, ATF4, MYH10 | CREBBP, ITGB8 | |
| B Cell receptor signaling | ATF4, NFATC1, SOS1 | CREBBP, INPP5K | |
| IL-6 signaling | ABCB1, SOS1 | TNFRSF11B, CYP19A1 |
Number of genes and samples from preprocessed and filtered gene expression data used in labeling, training, fine-tuning, and testing
| Dataset | Filtered genes (FG) | Significant labeled DEGs (SDEGs) | Bio genes(Q) | Non-bio train data(T1) | Non-bio test(T2) | Fine-tune(F1) | Bio-test(T3) |
|---|---|---|---|---|---|---|---|
| BRCA | 6514 | 4939 | 2327 | 3349 | 838 | 1861 | 466 |
| BLCA | 6514 | 2496 | 254 | 5008 | 1252 | 203 | 51 |
| CHOL | 6514 | 2811 | 552 | 4768 | 1193 | 441 | 111 |
| COAD | 6514 | 4213 | 1399 | 4092 | 1023 | 1119 | 280 |
| ESCA | 6514 | 1420 | 193 | 5056 | 1265 | 154 | 39 |
| HNSC | 6514 | 3860 | 734 | 4624 | 1156 | 587 | 147 |
| KICH | 6514 | 3422 | 306 | 4966 | 1242 | 244 | 62 |
| KIRC | 6514 | 4822 | 455 | 4847 | 1212 | 364 | 91 |
| KIRP | 6514 | 3535 | 337 | 4941 | 1236 | 269 | 68 |
| LIHC | 6514 | 4372 | 1498 | 4012 | 1004 | 1198 | 300 |
| LUAD | 6514 | 4387 | 566 | 4758 | 1190 | 452 | 114 |
| LUSC | 6514 | 4833 | 839 | 4540 | 1135 | 671 | 168 |
| PRAD | 6514 | 3803 | 1080 | 4347 | 1087 | 864 | 216 |
| READ | 6514 | 2678 | 121 | 5114 | 1279 | 96 | 25 |
| STAD | 6514 | 3379 | 388 | 4900 | 1226 | 310 | 78 |
| THCA | 6514 | 4292 | 3031 | 2786 | 697 | 2424 | 607 |
| UCEC | 6514 | 3992 | 999 | 4412 | 1103 | 799 | 200 |
Q: bio data; T1: non-bio train data; T2: non-bio test data; F1: fine tune data; T3: bio test data
Values of hyperparameters used in DEGnext model
| Hyperparameters | First level training | Fine-tuning |
|---|---|---|
| Epoch | 50 | 31 |
| Loss function | CrossEntropyLoss() | BCEWithLogitsLoss() |
| Learning-rate | 1e−4 | 1e−4 |
| Betas | (0.9, 0.999) | (0.9, 0.999) |
| eps | 1e−08 | 1e−08 |
| Weight-decay | 0 | 0 |
| Batch-size | 256 | 64 |