| Literature DB >> 20438654 |
Fadi Towfic1, Susan VanderPlas, Casey A Oliver, Oliver Couture, Christopher K Tuggle, M Heather West Greenlee, Vasant Honavar.
Abstract
BACKGROUND: Ortholog detection methods present a powerful approach for finding genes that participate in similar biological processes across different organisms, extending our understanding of interactions between genes across different pathways, and understanding the evolution of gene families.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20438654 PMCID: PMC2863066 DOI: 10.1186/1471-2105-11-S3-S7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Graph-based ortholog representation A schematic of the graph representation of the BLAST orthologs based on protein-protein interaction networks and gene coexpression networks. The networks are represented as two labeled graphs (G1 and G2) with corresponding relationships among their nodes (similarly colored nodes are sequence homologous according to a BLAST search). Nodes from G1 (e.g., v3) are compared to their sequence-homologous counterparts in G2 (e.g., v'2 and v'6) based on the topology of their neighborhood and sequence homology of the neighbors. In the figure, v'2 has the same number of neighbors of v3 and one of the neighbors of v'2 (i.e., v'3) is sequence-homologous to v4. Thus, v'2 is scored higher (more likely to be an ortholog to v3) compared to v'6. Protein-protein interaction networks are represented as unweighted graphs, while gene coexpression networks incorporate weights (as calculated by correlations) into their edges.
Figure 2Shortest path distance graph kernel An example of the graph matching conducted by the shortest path graph kernel. Similarly colored nodes are sequence homologous according to a BLAST search. As can be seen from the figure, the graph kernel compares the lengths of the shortest paths around homologous vertices across the two graphs (taking into account the weights of the edges, if available). The red edges show the matching shortest path in both graphs as computed by the graph kernel. The shortest path distance graph kernel takes into account the sequence homology score for the matching vertices across the two graphs as well as the distances between the two matched vertices within the graphs.
Figure 3Random walk graph kernel An example of the graph matching conducted by the random walk graph kernel. Similarly colored vertices are sequence homologous according to a BLAST search. As can be seen from the figure, the graph kernel compares the neighborhood around the starting vertices in each graph using random walks (taking into account the weights of the edges, if available). Colored edges indicate matching random walks across the two graphs of up to length 2. The random walk graph kernel takes into account the sequence homology of the vertices visited in the random walks across the two graphs as well as the general topology of the neighborhood around the starting vertex.
BLAST performance for ortholog detection
| Datasets | AUC |
|---|---|
| Mouse-Human (PPI) | 90.39 |
| Mouse-Fly (PPI) | 92.62 |
| Mouse-Yeast (PPI) | 96.14 |
| Human-Fly (PPI) | 88.89 |
| Human-Yeast (PPI) | 85.63 |
| Yeast-Fly (PPI) | 75.03 |
| Mouse-Human (gene-coexpression) | 90.40 |
Performance of the Reciprocal BLAST hit method on the fly, yeast, human and mouse protein-protein interaction datasets from DIP as well as the gene coexpression networks for mouse and human from GEO.
Classifier performance using BLAST score as the sole feature for ortholog detection
| Datasets | Adaboost j48 AUC | NB AUC | SVM AUC | Log. Reg. AUC | Ensemble AUC |
|---|---|---|---|---|---|
| Mouse-Human (PPI) | 87.79 (4) | 90.15 (3) | 77.31 (5) | 90.29 (2) | 90.30 (1) |
| Mouse-Human (gene-coexpression) | 89.80 (4) | 70.4 (5) | 90.40 (1) | 90.40 (1) | 90.40 (1) |
| Mouse-Fly (PPI) | 87.58 (4) | 88.47 (3) | 70.17 (5) | 92.01 (1) | 88.89 (2) |
| Mouse-Yeast (PPI) | 89.85 (5) | 91.89 (2) | 90.78 (3) | 95.46 (1) | 91.45 (4) |
| Human-Fly (PPI) | 81.35 (4) | 87.70 (2) | 65.90 (5) | 88.90 (1) | 84.42 (3) |
| Human-Yeast (PPI) | 82.97 (3) | 81.26 (4) | 63.68 (5) | 85.50 (1) | 84.19 (2) |
| Yeast-Fly (PPI) | 73.02 (3) | 72.49 (4) | 56.80 (5) | 74.86 (1) | 74.48 (2) |
| 3.83 | 3 | 4.67 | 1.17 | 2.33 | |
| 3.86 | 3.28 | 4.28 | 1.28 | 2.28 |
Performance of the Reciprocal BLAST hit score as a feature to the decision tree (j48), Naive Bayes (NB) Support Vector Machine (SVM) and Ensemble classifiers on the fly, yeast, human and mouse protein-protein interaction datasets from DIP as well as the gene coexpression networks for mouse and human from GEO. Values in parenthesis are the ranks for the classifiers on the specified dataset.
Classifier performance using all features for ortholog detection
| Datasets | Adaboost j48 AUC | NB AUC | SVM AUC | Log. Reg. AUC | Ensemble AUC |
|---|---|---|---|---|---|
| Mouse-Human (PPI) | 95.19 (2) | 88.72 (5) | 90.78 (3) | 89.57 (4) | 96.18 (1) |
| Mouse-Human (gene-coexpression) | 89.80 (5) | 94.1 (4) | 97.50 (1) | 97.30 (2) | 96.10 (3) |
| Mouse-Fly (PPI) | 90.31 (1) | 85.81 (3) | 81.28 (4) | 80.67 (5) | 88.94 (2) |
| Mouse-Yeast (PPI) | 92.04 (3) | 85.50 (4) | 79.63 (5) | 95.60 (1) | 95.50 (2) |
| Human-Fly (PPI) | 88.18 (1) | 83.10 (4) | 75.03 (5) | 87.04 (3) | 87.20 (2) |
| Human-Yeast (PPI) | 82.83 (2) | 81.26 (4) | 78.22 (5) | 81.57 (3) | 84.84 (1) |
| Yeast-Fly (PPI) | 74.52 (1) | 69.36 (4) | 64.57 (5) | 74.33 (2) | 72.78 (3) |
| 1.67 | 4 | 4.5 | 3 | 1.83 | |
| 2.14 | 4 | 4 | 2.86 | 2 |
Performance of all the combined features (Reciprocal BLAST hit score, 1 and 2 hop shortest path graph kernel score, 1 and 2 hop random walk graph kernel score, BaryCenter, betweenness, degree distribution and HITS) as input to the decision tree (j48), Naive Bayes (NB), Support Vector Machine (SVM) and Ensemble classifiers on the fly, yeast, human and mouse protein-protein interaction datasets from DIP as well as the gene coexpression networks for mouse and human from GEO. Values in parenthesis are the ranks for the classifiers on the specified dataset.
Figure 4Example of an ortholog pair detected by the ensemble classifier trained on network features A sample 1 hop neighborhood around one of the matched orthologs (TNF receptor-associated factor 2 "P39429" in mouse and "Q12933" in human) according to the graph features (LEFT: 1 hop network around the "P39429" protein for mouse, RIGHT: 1 hop neighborhood around the "Q12933" protein for human). Similarly colored nodes are sequence homologous. The graph properties search for similar topology and sequence homology around the neighborhood of the nodes being compared.
KEGG ortholog sample tables
| Mouse Protein | Human Protein | BLASTp score | RW 1HOP | SP 1HOP | RW 2HOP | SP 2HOP | BaryCenter | betweenness | Degree | HITS |
|---|---|---|---|---|---|---|---|---|---|---|
| P05627 | P05412 | 481 | 104 | 197.35 | 612 | 290.27 | 0.71 | 0.69 | 0.01 | 0.26 |
| P36898 | P36894 | 725 | 28.13 | 222.85 | 90.66 | 576.51 | 0.35 | 0.77 | 0.01 | 3.06E- 10 |
| P39429 | Q12933 | 870 | 48 | 126.18 | 150.47 | 187.45 | 0.79 | 0.11 | 0.01 | 1.20E- 4 |
KEGG orthologs detected using the Ensemble classifier utilizing all network features. The orthologs shown in the above table were missed by the BLAST logistic regression classifier.