| Literature DB >> 23514608 |
Liang Lan1, Nemanja Djuric, Yuhong Guo, Slobodan Vucetic.
Abstract
BACKGROUND: Protein function determination is a key challenge in the post-genomic era. Experimental determination of protein functions is accurate, but time-consuming and resource-intensive. A cost-effective alternative is to use the known information about sequence, structure, and functional properties of genes and proteins to predict functions using statistical methods. In this paper, we describe the Multi-Source k-Nearest Neighbor (MS-kNN) algorithm for function prediction, which finds k-nearest neighbors of a query protein based on different types of similarity measures and predicts its function by weighted averaging of its neighbors' functions. Specifically, we used 3 data sources to calculate the similarity scores: sequence similarity, protein-protein interactions, and gene expressions.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23514608 PMCID: PMC3584913 DOI: 10.1186/1471-2105-14-S3-S8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Visual summary of the datasets.
Summary of different data sources
| Data source | Training size | CAFA size |
|---|---|---|
| Protein sequence similarity | 36,924 | 48,298 |
| Microarray expression | 7,372 | 3,397 |
| Protein-protein interaction | 3,217 | 737 |
Comparison of average TermAUC of two different prediction algorithms
| Data sources | 122 MF terms | 546 BP terms | ||
|---|---|---|---|---|
| Sequence Similarity | 0.671 | 0.557 | ||
| Microarray | 0.555 | 0.563 | ||
| PPI | 0.574 | 0.580 | ||
Average TermAUC based on 5 training sets of different size.
| Training Set (Training Size) | 122 MF terms | 546 BP terms |
|---|---|---|
| TermAUC | TermAUC | |
| (1) HUMAN (1,567) | 0.671 | 0.557 |
| (2) HUMAN (7,412) | 0.728 | 0.609 |
| (3) HUMAN + MOUSE + RAT (16,442) | 0.807 | 0.692 |
| (4) All Mammals (16,754) | 0.812 | 0.696 |
| (5) All Organisms (35,622) |
Average TermAUC based on 7 training sets with same size
| Training Set (Training Size) | 122 MF terms | 546 BP terms |
|---|---|---|
| TermAUC | TermAUC | |
| (2) HUMAN (7,412) | 0.728 | 0.609 |
| (6) HUMAN + MOUSE + RAT (7,412) | 0.771 | 0.648 |
| (7) All Mammals (7,412) | 0.762 | 0.649 |
| (8) All Organisms (7,412) | 0.729 | 0.628 |
| (9) All Mammals excluding Human(7,412) | ||
| (10) All Organisms excluding Human (7,412) | 0.721 | 0.623 |
Comparison of AUCs of different methods
| Data Source | 122 MF terms | 546 BP terms |
|---|---|---|
| TermAUC | TermAUC | |
| 0.819 | 0.707 | |
| 0.574 | 0.580 | |
| 0.635 | 0.642 | |
| MS- | ||
| MS-W- | 0.829 | 0.758 |
| MS-CW- | 0.831 | 0.715 |
| MS-CW- | 0.851 | 0.702 |
Prediction score and rank for test proteins annotated by GO:0044106
| Proteins | Microarray | PPI | Sequence | Average |
|---|---|---|---|---|
| (AUC: 0.6127) | (AUC:0.5641) | (AUC: 0.8285) | (AUC: 0.9379) | |
| SYK_HUMAN | 0.14(1203) | 0 (NaN) | 2.17 (2) | 0.77 (3) |
| NOS3_HUMAN | 0.23 (212) | 0 (NaN) | 1.95 (3) | 0.73 (6) |
| NOS1_HUMAN | 0.29 (19) | 0 (NaN) | 1.92 (4) | 0.74 (5) |
| OAZ2_HUMAN | 0.17 (882) | 0 (NaN) | 1.80 (6) | 0.66 (8) |
| OAZ1_HUMAN | 0.18 (820) | 0 (NaN) | 1.63 (7) | 0.60 (9) |
| PEPD_HUMAN | 0.22 (340) | 0 (NaN) | 0 (NaN) | 0.07 (544) |
| PON1_HUMAN | 0.26 (66) | 1 (3) | 0 (NaN) | 0.42 (18) |
AUC scores for MF terms
| Algorithm | Threshold | Top n | Weighted threshold | TermAUC |
|---|---|---|---|---|
| Prior | 0.867 | 0.742 | 0.795 | 0.500 |
| BLAST | 0.794 | 0.779 | 0.734 | 0.634 |
| Gotcha | 0.786 | 0.774 | 0.728 | 0.665 |
| 0.814 | 0.780 | 0.747 | ||
| MS | 0.701 |
AUC scores for BP terms
| Algorithm | Threshold | Top n | Weighted threshold | TermAUC |
|---|---|---|---|---|
| Prior | 0.630 | 0.500 | ||
| BLAST | 0.771 | 0.633 | 0.697 | 0.648 |
| Gotcha | 0.748 | 0.637 | 0.677 | |
| 0.811 | 0.724 | 0.651 | ||
| MS- | 0.893 | 0.636 | 0.818 | 0.650 |
Figure 2AUC comparison on 11 MF functions.
Comparison of AUC scores on 8 test proteins based on MF terms
| Algorithm | Threshold | Top n | Weighted threshold |
|---|---|---|---|
| 0.853 | 0.740 | 0.768 | |
| MS- | 0.949 | 0.845 | 0.910 |
Comparison of AUC scores on 8 test proteins based on BP terms
| Algorithm | Threshold | Top n | Weighted threshold |
|---|---|---|---|
| 0.798 | 0.526 | 0.696 | |
| MS- | 0.920 | 0.526 | 0.846 |