| Literature DB >> 26043920 |
Abstract
BACKGROUND: In recent years, high-throughput protein interaction identification methods have generated a large amount of data. When combined with the results from other in vivo and in vitro experiments, a complex set of relationships between biological molecules emerges. The growing popularity of network analysis and data mining has allowed researchers to recognize indirect connections between these molecules. Due to the interdependent nature of network entities, evaluating proteins in this context can reveal relationships that may not otherwise be evident.Entities:
Mesh:
Year: 2015 PMID: 26043920 PMCID: PMC4460923 DOI: 10.1186/1755-8794-8-S2-S9
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
Figure 1. The root node indicates the bias in the data set, i.e., the ratio of positive to negative class examples (disease-associated proteins versus non-disease-associated proteins). The rectangles (decision nodes) contain the feature name. The number in parentheses within each decision node indicates the order in which the rule was found. The amount of node conservation between each of the trees generated in the validation step is indicated by the color of the box (red: ≥ 90%, orange: ≥ 70% (none in this tree), yellow: ≥ 50%, green: ≥ 30%, blue: ≥ 10%, black: ≤ 10% (none in this tree)). Ovals (prediction nodes) contain the value for the weighted vote, where a positive number indicates a prediction for disease-association. The numbers next to the arrows correspond to the threshold for the prediction. If the attribute value is equal to or exceeds this number, the left path is followed; otherwise the prediction follows the right path.
Figure 2ROC curves comparing five classifiers run over the disease-protein network data set. The top two performers were ADTree and AdaBoost (both AUC = 0.795), followed by the Bayesian network and the Naïve Bayesian classifiers (both AUC = 0.754), and finally the RBF network (AUC = 0.726). The curves are colored according to the threshold value and based on a color gradient scale from blue (threshold value of 0) to orange (threshold value of 1). This figure was created using Weka [40].
A subset of negative-class proteins predicted to be disease-related
| Conf. Score | OS | DORIF | OMIM | Suspected Disease Relationship |
|---|---|---|---|---|
| 6.24096 | CDH5 | - | - | Melanoma, tumor metastasis |
| 5.8186 | PTCH1 | - | 109400, 605462, 610828 | Basal Cell Nevus Syndrome, Basal Cell Carcinoma |
| 5.19721 | STAMBPL1 | - | - | Very light evidence, Alzheimer's |
| 5.14813 | MDH2 | - | - | Very light evidence, tumor development |
| 1.09972 | DPP4 | - | - | Diabetes (17 PMIDs), colon cancer (3 PMIDs) |
| 1.09972 | GRK5 | - | - | Very light evidence, heart failure |
| 0.907016 | GZMB | - | - | Lymphoma (30 PMIDs), tumors (92 PMIDs) |
| 0.898631 | TCF4 | - | 610954 | Pitt-Hopkins Syndrome, various cancer (light evidence) |
| 0.705929 | FGR | - | - | Breast cancer (3 PMIDs), prostate cancer (1 PMID) |
| 0.705929 | FLT1 | - | - | Cancer, various |
| 0.705929 | PECAM1 | - | - | Cancer, various |
| 0.705929 | SREBF2 | - | - | Prostate cancer (2 PMIDs) |
| 0.705929 | STAT6 | - | - | Prostate cancer (3 PMIDs) |
| 0.705929 | TOP1 | - | - | Leukemia, colon and ovarian cancer |
| 0.664823 | CD74 | - | - | Very light evidence, lymphoma |
A subset of 15 proteins belonging to the non-disease-related class (lacking DORIF annotation) but predicted to be disease-related, sorted by the ADTree-assigned confidence score. Two proteins (PTCH1 and TCF4) have associated OMIM disorders. 9/15 proteins (CDH5, DPP4, GZMB, FGR, FLT1, PECAM1, SREBF2, STAT6, and TOP1) have moderate to strong evidence of disease association, while 4/15 (STAMBPL1, MDH2, GRK5, and CD74) have light evidence linking them to disease. (n PMIDs) indicates the number of PubMed IDs connecting a protein to a particular disease. 'Conf. Score' is the confidence score assigned by the ADTree classifier, 'OS' is the official symbol of the gene, 'DORIF' is Disease Ontology + Gene Reference Into Function, 'OMIM' is the MIM number associated with the gene, 'light evidence' is defined as having a predicted disease association according to the MalaCards database [48]. Disease information for this table was acquired from the GeneCards database [49].
Figure 3The five most common diseases associated with the first-order neighbors of DPP4. The five most common diseases associated with the first-order neighbors of DPP4 (those proteins with a direct interaction). DPP4 has 55 PubMed IDs that associate it with non-insulin-dependent diabetes mellitus (NIDDM). Interestingly, NIDDM is often accompanied by beta cell autoimmunity, where the beta cells of the pancreas are destroyed by an autoimmune disorder [50].
Figure 4The five most common diseases associated with the second-order neighbors of FGR. The five most common diseases associated with the second-order neighbors (i.e., neighbors of neighbors) of FGR. There are three PMIDs associating this gene with breast cancer and one PMID linking it to prostate cancer.
Figure 5The network neighborhood of the transcription factor TCF4. A subset of proteins from the PPI-disease data set highlighting the relationship between breast cancer-related genes and the transcription factor TCF4, one of 15 proteins in our set of high-confidence false positive predictions (Table 1). Proteins are colored according to modularity class (four modules were identified). Proteins are labeled with the gene's official symbol with a '- 1' afterwards to indicate breast cancer association and a '- 0' to indicate no association. The TCF4 node has been made larger for identification purposes.