| Literature DB >> 35705654 |
Marwa Helmy1, Eman Eldaydamony1, Nagham Mekky1, Mohammed Elmogy2, Hassan Soliman1.
Abstract
Identifying genes related to Parkinson's disease (PD) is an active research topic in biomedical analysis, which plays a critical role in diagnosis and treatment. Recently, many studies have proposed different techniques for predicting disease-related genes. However, a few of these techniques are designed or developed for PD gene prediction. Most of these PD techniques are developed to identify only protein genes and discard long noncoding (lncRNA) genes, which play an essential role in biological processes and the transformation and development of diseases. This paper proposes a novel prediction system to identify protein and lncRNA genes related to PD that can aid in an early diagnosis. First, we preprocessed the genes into DNA FASTA sequences from the University of California Santa Cruz (UCSC) genome browser and removed the redundancies. Second, we extracted some significant features of DNA FASTA sequences using the PyFeat method with the AdaBoost as feature selection. These selected features achieved promising results compared with extracted features from some state-of-the-art feature extraction techniques. Finally, the features were fed to the gradient-boosted decision tree (GBDT) to diagnose different tested cases. Seven performance metrics were used to evaluate the performance of the proposed system. The proposed system achieved an average accuracy of 78.6%, the area under the curve equals 84.5%, the area under precision-recall (AUPR) equals 85.3%, F1-score equals 78.3%, Matthews correlation coefficient (MCC) equals 0.575, sensitivity (SEN) equals 77.1%, and specificity (SPC) equals 80.2%. The experiments demonstrate promising results compared with other systems. The predicted top-rank protein and lncRNA genes are verified based on a literature review.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35705654 PMCID: PMC9200794 DOI: 10.1038/s41598-022-14127-8
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
The used abbreviations.
| PD | Parkinson’s disease | ACC | Accuracy |
| lncRNA | Long non coding RNA | PPV | Positive predictive value |
| DFT | Discrete Fourier Transform | FFT | Fast Fourier Transform |
| MM | monoMonoKGap | MD | monoDiKGap |
| MT | monoTriKGap | DM | diMonoKGap |
| DD | diDiKGap | DT | diTriKGap |
| TM | triMonoKGap | TD | triDiKGap |
| A | Adenine | C | Cytosine |
| G | Guanine | T | Thymine |
| DT | Decision Tree | NB | Naive Bayes |
| TP | True positive | RF | Random Forest |
| FP | False positive | AB | Adaboost |
| LR | Logistic Regression | GBDT | gradient boosting decision tree |
| SVM | Support Vector Machine | LDA | Linear Discriminant Analysis |
| AUPR | Area under precision-recall | AUC | Area Under the Curve |
| FN | False negative | TN | True negative |
| SE | Sensitivity | SPC | specificity |
| TPR | True positive rate | FPR | False negative rate |
A comparison of some recent studies.
| Study | Year | Analysis | Methodology | Dataset |
|---|---|---|---|---|
| Radivojac et al.[ | 2008 | Identifying genes related to disease based on PPI | PPI, SVM | HPRD, Swiss-Prot |
| network | ||||
| Zhang et al.[ | 2011 | Predicting genes related to Parkinson’s disease based on gene expression | PCC, TOPPGene | NCBI GEO |
| Yang et al.[ | 2014 | Predicting disease-genes based on PPI, GO, and gene expression similarity | EPUI | HPRD, OPHID |
| Peng et al.[ | 2017 | Predicting disease-related genes based on genes, diseases, and ontology | SLN-SRW | Clinvar, GO, DO, STRING, OMIM |
| Hwang[ | 2017 | Identifying genes related to disease based on random forests | SRF | OMIM, HPRD, OPHID, GO |
| Tian et al.[ | 2017 | Predicting genes related to disease based on an integrated gene similarity network | RWRB, SNF | swiss-Prot, MimMiner, OMIM, GO, GOA, Pfam |
| Ding et al.[ | 2018 | Predicting lncRNAs genes related to diseases | TPGLDA | LncRNADisease, DisGeNET |
| Peng et al.[ | 2019 | Parkinson’s disease genes prediction based on proteins genes | N2A-SVM | ClinVar |
| Lei et al.[ | 2019 | Predicting disease-related genes based on protein , lncRNAs, and disease | InLPCH | LncRNADisease, HPRD, OMIM |
| Xuan et al.[ | 2019 | Predicting disease related to lncRNA genes | CNNLDA | LncRNADisease, Lnc2Cancer, GeneRIF, starBase, DincRNA |
| Zhang et al.[ | 2019 | Predicting lncRNAs related to disease based on lncRNAs, micoRNA, and diseases | DeepWalk, Rule-based inference | Lnc2Cancer, HMDD, miR2Disease,miRCancer, lncRNADisease |
| Yang et al.[ | 2020 | Predicting disease-related genes based on disease-gene gene-GO, and disease-phenotype | PDGNet | DisGeNet, HPO, OrphaNet, STRING, HPRD, IntAct, PINA, |
| Bonidia et al.[ | 2020 | Diagnosing between different cases lncRNAs | DFT, Entropy, Complex Network | RefSeq, GreeNC Ensembl (v87, v32) |
| Wang et al.[ | 2021 | Identifying lncRNAs related to diseases based on lncRNA, miRNA, and disease | LFMP | MNDRv2.0, MNDRv2.0, Starbase v2.0 |
| Joodaki et al.[ | 2021 | Identifying genes related to disease based on similarity network | RWRHN-FF | DisGeNet, OMIM, KEGG, UniProt, GO, Pfam, COXPRESdb |
| Bi et al.[ | 2021 | Predicting PD-related genes and brain regions | CERNNE | PPMI |
Figure 1The proposed prediction system for identifying protein and lncRNA genes associated with PD.
PyFeat feature generation and their numbers.
| Method | Number of features |
|---|---|
| Z-curve | 3 |
| GCcontent | 1 |
| ATGC ratio | 1 |
| Cumulative Skew | 2 |
| Pseudo composition | 84 |
| monoMonoKGap | 80 |
| monoDiKGap | 320 |
| monoTriKGap | 1280 |
| diMonoKGap | 320 |
| diDiKGap | 1280 |
| diTriKGap | 5120 |
| triMonoKGap | 1280 |
| triDiKGap | 5120 |
| # of features | 14,891 |
Definition of important variables, parameters, and symbols of formulas.
| S | Biological sequence of element values: A, C, G, T | L | Length of a sequence |
| Index of an element in a sequence for time domain | Value of an element at index l in time domain | ||
| Index of an element in a sequence for frequency domain | Value of an element at index f in frequency domain | ||
| The binary matrix with size (4*L) for [b1, b2, b3, b4] | Binary sequence for presenting A element | ||
| Binary sequence for presenting C element | Binary sequence for presenting G element | ||
| Binary sequence for presenting T element | Binary value of an element at index l in time domain | ||
| Frequency value of an element at index f for binary sequence | Power spectrum for B[f] | ||
| Integer representation sequence | Integer value of an element at index l in time domain | ||
| Frequency value of an element at index f for integer sequence | Power spectrum for I[f] | ||
| Real representation sequence | Real value of an element at index l in time domain | ||
| Frequency element’s value at index f for real sequence | Power spectrum for r[f] | ||
| Element’s value for x-coordination of Z-curve at index l in time domain | Element’s value for y-coordination of Z-curve at index l in time domain | ||
| Element’s value for z-coordination of Z-curve at index l in time domain | The Z-curve element’s value of x[l], y[l], and z[l] at index l in time domain | ||
| Frequency value of an element at index f for x-coordination of Z-curve | Frequency value of an element at index f for y-coordination of Z-curve | ||
| Frequency value of an element at index for z-coordination of Z-curve | Power spectrum for x[l], y[l], and z[l] | ||
| EIIP representation sequence | EIIP value of an element at index l in time domain | ||
| Frequency value of an element at index f for EIIP sequence | Power spectrum for D[f] | ||
| Length of the longest subsequence for pseudo composition method | Number of features extracted based on pseudo composition method | ||
| The number of Gap between nucleotides | The length of the longest KGap | ||
| Number of features extracted based on the monoMonoKGap method | Number of features extracted based on the monoDiKGap method | ||
| Number of features extracted based on the monoTriKGap method | Number of features extracted based on the diMonoKGap method | ||
| Number of features extracted based on the diDiKGap method | Number of features extracted based on the diTriKGap method | ||
| Number of features extracted based on the triMonoKGap method | Number of features extracted based on the triDiKGap method | ||
| Number of samples in dataset | The id of sample in dataset | ||
| The sample with id | The real label for sample | ||
| The initialized predicted value for all samples x, namely, initialized model | The id of the round or tree | ||
| Loss function between a predicted value | Number of rounds in training | ||
| Pseudo residuals or negative gradient of the loss function for | Number of terminal nodes at | ||
| Tree leaf node or terminal region with one or multiple | Final model at the end of training | ||
| Optimal output value of fitting the leaf node for samples in each leaf node | Learning rate with |
Datasets description.
| Datasets | Site | Positive | Negative |
|---|---|---|---|
| Protein | ClinVar | 182 | 185 |
| LncRNA | LncRNADisease v2.0 | 137 | 141 |
The performance evaluation of the proposed features based on PyFeat with AB compared with other techniques: five numerical representations, RFF, Pse-in-One2.0, iLearn, and SubFeat with 10-fold cross-validation based on the protein dataset. Significant values are in bold.
| Metric | ACC (%) | AUC (%) | AUPR (%) | F1-score (%) | MCC | SEN (%) | SPC (%) |
|---|---|---|---|---|---|---|---|
| Binary | 53.0 | 57.6 | 58.4 | 53.1 | 0.061 | 54.1 | 51.9 |
| Integer | 61.8 | 64.7 | 66.9 | 60.6 | 0.241 | 60.1 | 63.4 |
| Real | 57.1 | 62.2 | 64.0 | 57.5 | 0.142 | 58.5 | 55.7 |
| Z-curve | 59.0 | 60.9 | 64.7 | 58.1 | 0.182 | 57.4 | 60.7 |
| EIIP | 58.4 | 62.3 | 65.5 | 57.7 | 0.171 | 57.4 | 59.6 |
| RFF | 63.4 | 66.4 | 66.6 | 64.9 | 0.271 | 68.3 | 58.5 |
| Pse-in-One2.0 | 62.3 | 65.3 | 65.3 | 62.9 | 0.246 | 63.7 | 60.9 |
| iLearn | 60.7 | 62.9 | 64.1 | 59.9 | 0.215 | 59.3 | 62.0 |
| SubFeat | 59.6 | 63.0 | 67.0 | 58.9 | 0.195 | 57.4 | 61.7 |
| Proposed features |
Figure 2The performance evaluation of the features based on PyFeat with AB compared with other techniques: five numerical representations, RFF, Pse-in-One2.0, iLearn, and SubFeat based on the protein dataset.
The performance evaluation of the proposed features based on PyFeat with AB compared with other techniques: five numerical representations, RFF, Pse-in-One2.0, iLearn, and SubFeat with 10-fold cross-validation based on the lncRNA dataset. Significant values are in bold.
| Metric | ACC (%) | AUC (%) | AUPR (%) | F1-score (%) | MCC | SEN (%) | SPC (%) |
|---|---|---|---|---|---|---|---|
| Binary | 60.4 | 64.9 | 66.2 | 60.1 | 0.209 | 60.1 | 60.7 |
| Integer | 61.8 | 64.4 | 67.9 | 59.6 | 0.237 | 57.9 | 65.6 |
| Real | 59.3 | 61.4 | 65.3 | 59.3 | 0.190 | 60.1 | 58.5 |
| Z-curve | 60.4 | 63.8 | 66.8 | 59.2 | 0.214 | 58.5 | 62.3 |
| EIIP | 60.1 | 63.1 | 66.7 | 60.2 | 0.206 | 60.7 | 59.6 |
| RFF | 67.5 | 67.4 | 64.0 | 66.5 | 0355 | 65.4 | 69.2 |
| Pse-in-One2.0 | 65.5 | 64.2 | 63.7 | 65.6 | 0.316 | 66.5 | 64.7 |
| iLearn | 59.4 | 64.8 | 67.5 | 61.2 | 0.193 | 64.4 | 54.2 |
| SubFeat | 63.6 | 66.7 | 67.7 | 63.5 | 0.278 | 64.5 | 62.6 |
| Proposed features |
Figure 3The performance evaluation of the features based on PyFeat with AB compared with other techniques: five numerical representations, RFF, Pse-in-One2.0, iLearn, and SubFeat based on the lncRNA dataset.
The performance evaluation of the proposed system based on the GBDT compared with state-of-the-art classifiers using 4-fold and 10-fold cross-validation techniques on the protein dataset. Significant values are in bold.
| Metric | K-fold | ACC (%) | AUC (%) | AUPR (%) | F1-score (%) | MCC | SEN (%) | SPC (%) |
|---|---|---|---|---|---|---|---|---|
| LR | 4 | 65.7 | 71.7 | 68.7 | 64.5 | 0.316 | 63.5 | 67.9 |
| 10 | 66.8 | 72.1 | 70.9 | 65.8 | 0.340 | 65.2 | 68.5 | |
| DT | 4 | 62.4 | 62.5 | 57.7 | 61.8 | 0.250 | 61.3 | 63.6 |
| 10 | 61.4 | 61.4 | 56.9 | 61.9 | 0.230 | 63.5 | 59.2 | |
| NB | 4 | 48.5 | 47.6 | 47.4 | 17.8 | − 0.097 | 22.1 | 74.5 |
| 10 | 46.6 | 47.2 | 49.7 | 6.93 | − 0.139 | 8.8 | 83.7 | |
| Bagging | 4 | 66.6 | 72.3 | 69.4 | 61.8 | 0.340 | 54.7 | 78.3 |
| 10 | 68.8 | 75.3 | 73.4 | 66.4 | 0.381 | 63.0 | 74.5 | |
| RF | 4 | 75.3 | 83.8 | 83.4 | 74.2 | 0.508 | 71.8 | |
| 10 | 77.2 | 75.3 | 0.554 | 70.7 | ||||
| AB | 4 | 72.3 | 80.7 | 78.2 | 72.6 | 0.449 | 74.0 | 70.7 |
| 10 | 74.2 | 81.7 | 82.5 | 73.5 | 0.501 | 71.8 | 76.3 | |
| SVM | 4 | 68.8 | 74.9 | 74.0 | 67.0 | 0.378 | 64.1 | 73.4 |
| 10 | 68.8 | 75.4 | 75.6 | 67.8 | 0.378 | 66.9 | 70.7 | |
| LDA | 4 | 59.5 | 60.5 | 58.2 | 58.6 | 0.190 | 57.5 | 61.4 |
| 10 | 60.2 | 62.0 | 60.2 | 60.7 | 0.207 | 63.0 | 57.6 | |
| GBDT | 4 | 75.0 | ||||||
| 10 | 86.0 | 82.1 |
The performance evaluation of the proposed system based on the GBDT compared with state-of-the-art classifiers using 4-fold and 10-fold cross-validation techniques based on the lncRNA dataset. Significant values are in bold.
| Metric | K-fold | ACC (%) | AUC (%) | AUPR (%) | F1-score (%) | MCC | SEN (%) | SPC (%) |
|---|---|---|---|---|---|---|---|---|
| LR | 4 | 62.5 | 70.3 | 70.8 | 62.0 | 0.254 | 62.3 | 62.6 |
| 10 | 66.0 | 71.6 | 71.1 | 65.1 | 0.322 | 64.1 | 67.9 | |
| DT | 4 | 62.0 | 61.9 | 57.3 | 60.5 | 0.246 | 61.3 | 62.6 |
| 10 | 57.0 | 57.0 | 54.3 | 57.1 | 0.141 | 59.7 | 54.3 | |
| NB | 4 | 54.9 | 55.0 | 52.6 | 66.9 | 0.164 | 91.5 | 18.7 |
| 10 | 48.2 | 47.5 | 51.7 | 8.6 | − 0.075 | 10.0 | 85.9 | |
| Bagging | 4 | 71.4 | 75.7 | 73.2 | 68.7 | 0.433 | 64.2 | 78.5 |
| 10 | 65.2 | 72.0 | 71.0 | 62.8 | 0.310 | 59.1 | 71.2 | |
| RF | 4 | 71.8 | 80.0 | 82.6 | 70.2 | 0.442 | 68.9 | 74.8 |
| 10 | 74.2 | 82.0 | 83.0 | 73.0 | 0.487 | 70.7 | 77.7 | |
| AB | 4 | 73.7 | 79.4 | 75.4 | 74.1 | 0.478 | 75.5 | 72.0 |
| 10 | 72.3 | 79.9 | 79.8 | 71.6 | 0.541 | 70.2 | 74.5 | |
| SVM | 4 | 70.5 | 77.0 | 74.9 | 71.4 | 0.416 | 75.5 | 65.4 |
| 10 | 67.67 | 74.5 | 74.0 | 66.4 | 0.355 | 65.2 | 70.1 | |
| LDA | 4 | 55.9 | 60.2 | 57.5 | 55.4 | 0.120 | 56.6 | 55.1 |
| 10 | 61.1 | 62.3 | 58.9 | 60.5 | 0.226 | 61.9 | 60.3 | |
| GBDT | 4 | |||||||
| 10 |
Figure 4The AUC for the proposed system based on the GBDT compared with state-of-the-art classifiers on the protein dataset with (a) 4-fold and (b) 10-fold cross-validation techniques.
Figure 5The accuracy box for the proposed system based on the GBDT compared with state-of-the-art classifiers on the protein dataset with (a) 4-fold and (b) 10-fold cross-validation techniques.
Figure 6The accuracy box for the proposed system based on the GBDT compared with state-of-the-art classifiers on the lncRNA dataset with (a) 4-fold and (b) 10-fold cross-validation techniques.
Figure 7The AUC for the proposed system based on the GBDT compared with state-of-the-art classifiers on the lncRNA dataset with (a) 4-fold and (b) 10-fold cross-validation techniques.
The performance comparison, classification methods, and feature selection methods used in state-of-art systems compared with the proposed system based on the protein dataset. Significant values are in bold.
| System | ACC (%) | AUC (%) | AUPR (%) | F1-score (%) | MCC | SEN (%) | SPC (%) | Classification method | Feature selection method |
|---|---|---|---|---|---|---|---|---|---|
| Bonidia et al.[ | 66.3 | 71.8 | 73.5 | 67.0 | 0.331 | 68.9 | 63.6 | RF | None |
| Nosrati et al.[ | 60.9 | 66.0 | 65.4 | 58.8 | 0.219 | 58.2 | 63.6 | RF | None |
| SUN et al.[ | 63.1 | 68.0 | 68.7 | 61.0 | 0.266 | 57.7 | 68.5 | SVM | F-score, Greedy Algorithm |
| Haque et al.[ | 58.2 | 63.6 | 62.1 | 55.4 | 0.166 | 53.0 | 63.4 | SVM, SVM, SVM | None |
| Proposed system |
The performance comparison, classification methods, and feature selection methods used in state-of-art systems compared with the proposed system based on the lncRNA dataset. Significant values are in bold.
| System | ACC (%) | AUC (%) | AUPR (%) | F1-score (%) | MCC | SEN (%) | SPC (%) | Classification method | Feature selection method |
|---|---|---|---|---|---|---|---|---|---|
| Bonidia et al.[ | 60.2 | 61.4 | 66.0 | 59.3 | 0.210 | 58.9 | 61.7 | RF | None |
| Nosrati et al.[ | 63.1 | 65.5 | 65.5 | 63.9 | 0.266 | 66.5 | 59.9 | RF | None |
| SUN et al.[ | 64.6 | 68.9 | 69.5 | 63.6 | 0.299 | 64.5 | 64.5 | SVM | F-score, Greedy Algorithm |
| Haque et al.[ | 56.7 | 61.6 | 63.2 | 53.5 | 0.418 | 51.4 | 61.7 | SVM,SVM,SVM | None |
| Proposed system |
Figure 8The comparison of the proposed system compared with the state-of-art systems based on the protein dataset. (a) AUC under ROC Curve. (b) Performance evaluation.
Figure 9The comparison of the proposed system compared with the state-of-art systems based on the lncRNA dataset. (a) AUC under ROC Curve. (b) Performance evaluation.
Average performance of the proposed prediction system based on protein and lncRNA datasets.
| Datasets | ACC (%) | AUC (%) | AUPR (%) | F1-score (%) | MCC | SEN (%) | SPC (%) |
|---|---|---|---|---|---|---|---|
| Proteins | 79.4 | 84.9 | 86.0 | 78.7 | 0.590 | 76.8 | 82.1 |
| LncRNAs | 77.8 | 84.1 | 84.5 | 77.4 | 0.560 | 77.3 | 78.3 |
| Average | 78.6 | 84.5 | 85.3 | 78.1 | 0.575 | 77.1 | 80.2 |
Figure 10The average performance evaluation of the proposed prediction system based on protein and lncRNA datasets.
The comparison between our proposed system and some current systems based on AUC. Significant value is in bold.
| Peng et al.[ | Lie et al.[ | Peng et al.[ | The proposed system | |
|---|---|---|---|---|
| AUC (%) | 72.9 | 78.6 | 79.0 |
Figure 11The comparison between our proposed system and some current studies based on AUC.