| Literature DB >> 30862804 |
Yan Wang1, Sen Yang1, Jing Zhao2,3, Wei Du1, Yanchun Liang1,4, Cankun Wang5, Fengfeng Zhou1, Yuan Tian6,7, Qin Ma8,9.
Abstract
Measuring conditional relatedness between a pair of genes is a fundamental technique and still a significant challenge in computational biology. Such relatedness can be assessed by gene expression similarities while suffering high false discovery rates. Meanwhile, other types of features, e.g., prior-knowledge based similarities, is only viable for measuring global relatedness. In this paper, we propose a novel machine learning model, named Multi-Features Relatedness (MFR), for accurately measuring conditional relatedness between a pair of genes by incorporating expression similarities with prior-knowledge based similarities in an assessment criterion. MFR is used to predict gene-gene interactions extracted from the COXPRESdb, KEGG, HPRD, and TRRUST databases by the 10-fold cross validation and test verification, and to identify gene-gene interactions collected from the GeneFriends and DIP databases for further verification. The results show that MFR achieves the highest area under curve (AUC) values for identifying gene-gene interactions in the development, test, and DIP datasets. Specifically, it obtains an improvement of 1.1% on average of precision for detecting gene pairs with both high expression similarities and high prior-knowledge based similarities in all datasets, comparing to other linear models and coexpression analysis methods. Regarding cancer gene networks construction and gene function prediction, MFR also obtains the results with more biological significances and higher average prediction accuracy, than other compared models and methods. A website of the MFR model and relevant datasets can be accessed from http://bmbl.sdstate.edu/MFR .Entities:
Mesh:
Year: 2019 PMID: 30862804 PMCID: PMC6414665 DOI: 10.1038/s41598-019-40780-7
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Workflow of MFR model. Five steps are in the workflow, including (i) gene pair samples collection, (ii) gene features extraction, (iii) gene pair features calculation, (iv) SVM model construction and (v) verification and discussion.
Structure of MFR dataset.
| sub dataset | Coexpression | Prior-knowledge based | ||||||
|---|---|---|---|---|---|---|---|---|
| sub-sub dataset | KEGG | PPI | TURRUST | |||||
| Resource database | The COXPRESdb[ | The KEGG[ | Ref.[ | The TRRUST[ | ||||
| Type of gene pair | Positive | Negative | Positive | Negative | Positive | Negative | Positive | Negative |
| Sample size | 30,353 | 29,607 | 13,386 | 13,386 | 18,227 | 26,533 | 5,034 | 5,034 |
Sample size of RNA-seq data for four cancer types.
| Type | Cancer (Samples) | Normal (Samples) |
|---|---|---|
| Bladder urothelial carcinoma (BLCA) | 408 | 19 |
| Breast invasive carcinoma (BRCA) | 1095 | 113 |
| Colon adenocarcinoma (COAD) | 285 | 41 |
| Lung adenocarcinoma (LUAD) | 515 | 59 |
Figure 2Structure of the MFR model. The model is based on SVM and uses 12 similarity-based gene pair features as input; and the output value, namely MFR, is applied as an assessment criterion for measuring conditional relatedness between genes.
Figure 3(A) ROCs of nine models or methods for identifying gene-gene interactions by the 10-fold cross-validation. (B) Average PPVs of nine models or methods for detecting B0/B1 matched gene pairs by 10-fold cross-validation.
Figure 4ROCs of nine models or methods for identifying gene-gene interactions in the (A) test, (C) GeneFriends and (E) DIP datasets. Average PPVs of nine models or methods for detecting B0/B1 matched gene pairs in the (B) test, (D) GeneFriends and (F) DIP datasets.
Gene markers for glutamine and glutamate metabolism.
| Gene | Description | Go Term |
|---|---|---|
| ASNS | Asparagine Synthetase | asparagine biosynthetic process |
| ALDH18A1 | Aldehyde Dehydrogenase 18 Family Member A1 | proline biosynthetic process |
| CAD | Carbamoyl-Phosphate Synthetase 2, Aspartate Transcarbamylase, And Dihydroorotase | ‘ |
| CS | Citrate Synthase | tricarboxylic acid cycle |
| CTPS | CTP Synthase 1 | ‘ |
| CTPS2 | CTP Synthase 2 | ‘ |
| DLD | Dermcidin | 2-oxoglutarate metabolic process |
| DLST | Dihydrolipoamide S-Succinyltransferase | tricarboxylic acid cycle |
| GFPT1 | Glutamine-Fructose-6-Phosphate Transaminase 1 | UDP-N-acetylglucosamine biosynthetic process |
| GFPT2 | Glutamine-Fructose-6-Phosphate Transaminase 2 | UDP-N-acetylglucosamine biosynthetic process |
| GLUL | Glutamate-Ammonia Ligase | glutamine biosynthetic process |
| GLS | Glutaminase | glutamate biosynthetic process |
| GLS2 | Glutaminase 2 | glutamate biosynthetic process |
| OGDH | Oxoglutarate Dehydrogenase | tricarboxylic acid cycle |
| GGDHL | Oxoglutarate Dehydrogenase-like | tricarboxylic acid cycle |
| PFAS | Phosphoribosylformylglycinamidine Synthase | ‘ |
| PPAT | Phosphoribosyl Pyrophosphate Amidotransferase | ‘ |
| PSAT1 | Phosphoserine Aminotransferase 1 | ‘ |
| GCLC | Glutamate-Cysteine Ligase Catalytic Subunit | glutathione biosynthetic process |
| GCLM | Glutamate-Cysteine Ligase Modifier Subunit | glutathione biosynthetic process |
| GSS | Glutathione Synthetase | glutathione biosynthetic process |
Figure 5Metabolic pathways are predicted to be directly influenced by increased glutamine and glutamate metabolism in nine BRCA gene networks.
Figure 6Number of metabolic pathways predicted to be directly influenced by increased glutamine and glutamate metabolism in four cancer types. These pathways were predicted in cancer gene networks, where nodes represent up-regulated metabolic genes and edges represent relatedness between genes, measured by the five linear models and six coexpression analysis methods.
Figure 7Percentages of L0- and L1-matched selected genes in the nine KEGG metabolic gene networks. In these networks, nodes represent genes involved in KEGG metabolism pathways, and edges represent relatedness between genes, measured by the nine models or methods.
Performances of the nine models or methods for different applications.
| Application | Evaluation | MFR | Logit | LDA | PCC | SRC | MI | PPC | CMI | CXP |
|---|---|---|---|---|---|---|---|---|---|---|
| 10-fold cross-validation | AUC |
| 0.818 | 0.818 | 0.699 | 0.692 | 0.664 | 0.695 | 0.484 | 0.686 |
| B0 + B1 |
| 0.916 | 0.916 | 0.495 | 0.469 | 0.456 | 0.366 | 0.152 | 0.455 | |
| Test verification | AUC |
| 0.822 | 0.822 | 0.696 | 0.690 | 0.658 | 0.691 | 0.484 | 0.682 |
| B0 + B1 |
| 0.856 | 0.856 | 0.440 | 0.518 | 0.428 | 0.270 | 0.172 | 0.477 | |
| GeneFriends verification | AUC | 0.816 |
|
| 0.815 | 0.764 | 0.733 | 0.823 | 0.484 | 0.782 |
| B0 + B1 |
| 0.957 | 0.957 | 0.571 | 0.471 | 0.483 | 0.613 | 0.091 | 0.485 | |
| DIP verification | AUC |
| 0.724 | 0.724 | 0.604 | 0.617 | 0.586 | 0.602 | 0.487 | 0.600 |
| B0 + B1 |
| 0.713 | 0.713 | 0.544 | 0.507 | 0.519 | 0.438 | 0.142 | 0.463 | |
| Constructing a cancer gene network | NPP |
| 12 | 14 | 10 | 10 | 11 | 12 | 8 | 10 |
| Predicting gene function | L0 + L1 |
| 32.45 | 32.45 | 4.83 | 8.89 | 6.42 | 7.18 | 0.16 | 1.89 |
B0 + B1 indicates the average value of PPVs of B0- and B1-matched genes; NPP indicates the number of predicted metabolic pathways; L0 + L1 indicates the average number of L0- and L1-matched genes