| Literature DB >> 30154492 |
Xing-Zhong Zhang1, Yan-Li Pang2, Xian Wang3, Yan-Hui Li4.
Abstract
Human polycystic ovary syndrome (PCOS) is a highly heritable disease regulated by genetic and environmental factors. Identifying PCOS genes is time consuming and costly in wet-lab. Developing an algorithm to predict PCOS candidates will be helpful. In this study, for the first time, we systematically analyzed properties of human PCOS genes. Compared with genes not yet known to be involved in PCOS regulation, known PCOS genes display distinguishing characteristics: (i) they tend to be located at network center; (ii) they tend to interact with each other; (iii) they tend to enrich in certain biological processes. Based on these features, we developed a machine-learning algorithm to predict new PCOS genes. 233 PCOS candidates were predicted with a posterior probability >0.9. Evidence supporting 7 of the top 10 predictions has been found.Entities:
Mesh:
Year: 2018 PMID: 30154492 PMCID: PMC6113217 DOI: 10.1038/s41598-018-31110-4
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Network Characteristics of PCOS Genes.
| Dataset | Class | Size | Degree | Betweenness | 1st PCOS Ratio | 2nd PCOS Ratio | |
|---|---|---|---|---|---|---|---|
| Total | PCOS | 306 | 41.81 | 16.78 | 34463 | 0.11 | 0.04 |
| Non-PCOS | 16676 | 22.48 | 11.66 | 17278 | 0.04 | 0.03 | |
| 4.2E-13 | 1.7E-10 | 1.4E-17 | 3.0E-48 | 6.0E-20 | |||
| PCOSDB | PCOS | 185 | 50.54 | 18.78 | 40177 | 0.10 | 0.03 |
| Non-PCOS | 16676 | 22.48 | 11.66 | 17278 | 0.02 | 0.02 | |
| 1.5E-12 | 5.2E-10 | 2.0E-12 | 2.0E-36 | 1.3E-09 | |||
| PCOSKB | PCOS | 226 | 38.60 | 15.43 | 33614 | 0.09 | 0.04 |
| Non-PCOS | 16676 | 22.48 | 11.66 | 17278 | 0.03 | 0.02 | |
| 2.6E-07 | 9.1E-07 | 5.3E-12 | 1.5E-32 | 1.1E-20 |
“Total” indicates all the PCOS genes covered by either PCOSDB or PCOSKB. “Non-PCOS” indicates the remaining genes. The degree of a gene is defined as the number of its direct interaction genes. A K-core of a network can be obtained by recursively deleting genes with a degree lower than K, until the remaining genes in subnetwork have a degree higher than K. Betweenness counts the number of times that a gene is on the shortest path between two other genes. 1st PCOS ratio is defined as the ratio of the number of PCOS genes that it direct interacts to its degree. 2nd PCOS ratio is defined as the ratio of the number of PCOS genes that belong to 2nd interaction genes to its number of 2nd interaction genes. The P values were calculated by KS test. PCOS represents polycystic ovary syndrome.
Figure 1Cumulative frequency distributions of network features of PCOS genes and non-PCOS genes. The PCOS genes tend to have higher degree (A), K-core (B), betweenness (C), 1st PCOS ratio (D), and 2nd PCOS ratio (E) than that of non-PCOS genes. The cumulative frequency of different features is 100% for PCOS genes and non-PCOS genes. PCOS represents polycystic ovary syndrome.
The Classification Performance of Different Classifiers.
| Classifier | Precision | Recall | F1 | AUC |
|---|---|---|---|---|
| KNN ( | 0.77 | 0.69 | 0.73 | 0.78 |
| Decision tree | 0.76 | 0.74 | 0.75 | 0.79 |
| SVM (liner) | 0.71 | |||
| SVM (polynomial d = 3) | 0.49 | 0.73 | 0.58 | 0.57 |
| SVM (RBF) | 0.79 | 0.68 | 0.73 | 0.79 |
SVM (linear), SVM (polynomial d = 3) and SVM (RBF) means the kernel function of SVM is linear, polynomial, and radial basis function, respectively.
Figure 2The ROC curve of SVM (liner). SVM (liner) achieved the best classification performance using network and GO functional features.
Top 25 Predicted PCOS Genes.
| Symbol | Name | Posterior Probability |
|---|---|---|
| CTNNB1 | catenin beta 1 | 0.99932 |
| THBS1 | thrombospondin 1 | 0.99864 |
| IFNG | interferon gamma | 0.99794 |
| SMAD3 | SMAD family member 3 | 0.99736 |
| WNT5A | Wnt family member 5 A | 0.99694 |
| EGFR | epidermal growth factor receptor | 0.9964 |
| HIF1A | hypoxia inducible factor 1 subunit alpha | 0.99623 |
| SRC | SRC proto-oncogene, non-receptor tyrosine kinase | 0.99614 |
| ENG | endoglin | 0.99536 |
| NOG | noggin | 0.99505 |
| SIRT1 | sirtuin 1 | 0.99498 |
| PTEN | phosphatase and tensin homolog | 0.99429 |
| SHH | sonic hedgehog | 0.9936 |
| CAV1 | caveolin 1 | 0.9934 |
| SMAD4 | SMAD family member 4 | 0.99129 |
| GREM1 | gremlin 1, DAN family BMP antagonist | 0.9893 |
| BMP10 | bone morphogenetic protein 10 | 0.9886 |
| GDF5 | growth differentiation factor 5 | 0.98854 |
| FGA | fibrinogen alpha chain | 0.98846 |
| GATA3 | GATA binding protein 3 | 0.98752 |
| TGFBR3 | transforming growth factor beta receptor 3 | 0.98745 |
| JAK2 | Janus kinase 2 | 0.98715 |
| LYN | LYN proto-oncogene, Src family tyrosine kinase | 0.98662 |
| NOTCH1 | notch 1 | 0.98624 |
| LGALS9 | galectin 9 | 0.98621 |
Posterior probability is given by SVM to evaluate the reliability of the prediction. SVM represents support vector machine.
Formal Representation of Graph Measures.
| Name | Function | Descriptions |
|---|---|---|
| Degree |
| the number of direct interaction partners of node i |
| Degree-2 |
| the number of 2-step interaction partners of node i |
| 1st PCOS ratio |
| |
| 2nd PCOS ratio |
| |
| Betweenness |
| |
|
| A |
Functions are the definitions of the topological features. Descriptions give explanations for symbols in the definitions.