| Literature DB >> 19630945 |
Daniel Faria1, António E N Ferreira, André O Falcão.
Abstract
BACKGROUND: Efficient and accurate prediction of protein function from sequence is one of the standing problems in Biology. The generalised use of sequence alignments for inferring function promotes the propagation of errors, and there are limits to its applicability. Several machine learning methods have been applied to predict protein function, but they lose much of the information encoded by protein sequences because they need to transform them to obtain data of fixed length.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19630945 PMCID: PMC2724424 DOI: 10.1186/1471-2105-10-231
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Characterization of the EC families selected
| EC family | P Train | N Train | P Test | N Test | Avg SICP | Std SICP | Avg SICN | Std SICN | NNP | NPN |
| 1.1.1.1 | 168 | 3611 | 42 | 893 | 85% | 15% | 47% | 9% | 2 | 9 |
| 1.1.1.25 | 173 | 3607 | 44 | 890 | 79% | 21% | 41% | 6% | 6 | 4 |
| 1.8.4.11 | 163 | 156 | 41 | 40 | 87% | 15% | 18% | 24% | 2 | 5 |
| 2.1.2.10 | 172 | 1300 | 44 | 325 | 86% | 16% | 26% | 17% | 2 | 0 |
| 2.3.2.6 | 160 | 160 | 41 | 41 | 82% | 18% | 31% | 21% | 2 | 1 |
| 2.5.1.55 | 161 | 2652 | 41 | 648 | 94% | 8% | 41% | 12% | 1 | 1 |
| 2.7.1.11 | 162 | 3381 | 41 | 844 | 86% | 20% | 46% | 5% | 4 | 1 |
| 2.7.1.21 | 165 | 3381 | 42 | 840 | 83% | 21% | 39% | 17% | 7 | 1 |
| 2.7.2.1 | 173 | 942 | 44 | 236 | 85% | 15% | 46% | 4% | 1 | 1 |
| 2.7.7.27 | 167 | 5606 | 42 | 1313 | 89% | 13% | 29% | 3% | 0 | 4 |
| 3.1.26.11 | 168 | 1109 | 42 | 278 | 85% | 15% | 24% | 17% | 1 | 3 |
| 3.5.4.19 | 175 | 1549 | 44 | 388 | 84% | 14% | 30% | 21% | 0 | 0 |
| 4.1.1.31 | 161 | 2603 | 41 | 649 | 87% | 17% | 18% | 18% | 1 | 1 |
| 4.2.3.4 | 172 | 488 | 43 | 123 | 79% | 20% | 23% | 21% | 3 | 0 |
| 5.1.1.1 | 175 | 387 | 44 | 97 | 79% | 23% | 20% | 17% | 3 | 2 |
| 5.1.1.3 | 163 | 400 | 41 | 99 | 88% | 17% | 30% | 17% | 2 | 2 |
| 5.3.1.24 | 173 | 1744 | 44 | 431 | 78% | 20% | 43% | 14% | 8 | 3 |
| 6.3.4.3 | 164 | 1071 | 42 | 267 | 84% | 15% | 31% | 14% | 0 | 0 |
P Train – size of positive training; N Train – negative training; P Test – positive testing; N Test – negative testing; Avg SICP – average sequence identity between each positive test protein and the closest positive training protein; Std SICP – standard deviation of the SICP; Avg SICN – average sequence identity between each positive test protein and the closest negative training protein; Std SICN – standard deviation of the SICN; NNP – near negative positives (i.e. positive test proteins that have a closest negative at least within 10% sequence indentity of the closest positive); NPN – near positive negatives.
Training and test results for EC family classification using Peptide Programs
| Training | Test | |||||||
| EC Family | Accuracy | Precision | Recall | MCC | Accuracy | Precision | Recall | MCC |
| 1.1.1.1 | 99% | 97% | 87% | 0,91 | 99% | 94% | 81% | 0.87 |
| 1.1.1.25 | 100% | 99% | 94% | 0,96 | 99% | 93% | 84% | 0.88 |
| 1.8.4.11 | 100% | 100% | 100% | 1,00 | 100% | 100% | 100% | 1.00 |
| 2.1.2.10 | 100% | 100% | 100% | 1,00 | 100% | 98% | 100% | 0.99 |
| 2.3.2.6 | 100% | 100% | 100% | 1,00 | 100% | 100% | 100% | 1.00 |
| 2.5.1.55 | 100% | 100% | 100% | 1,00 | 100% | 98% | 98% | 0.97 |
| 2.7.1.11 | 100% | 100% | 96% | 0,98 | 100% | 100% | 90% | 0.95 |
| 2.7.1.21 | 99% | 99% | 86% | 0,92 | 99% | 97% | 79% | 0.87 |
| 2.7.2.1 | 100% | 100% | 100% | 1,00 | 100% | 100% | 100% | 1.00 |
| 2.7.7.27 | 100% | 96% | 93% | 0,95 | 99% | 93% | 88% | 0.90 |
| 3.1.26.11 | 100% | 100% | 99% | 1,00 | 99% | 93% | 100% | 0.96 |
| 3.5.4.19 | 100% | 100% | 97% | 0,98 | 99% | 100% | 91% | 0.95 |
| 4.1.1.31 | 100% | 100% | 100% | 1,00 | 100% | 98% | 100% | 0.99 |
| 4.2.3.4 | 100% | 100% | 100% | 1,00 | 99% | 100% | 98% | 0.98 |
| 5.1.1.1 | 100% | 100% | 100% | 1,00 | 100% | 100% | 100% | 1.00 |
| 5.1.1.3 | 100% | 100% | 100% | 1,00 | 100% | 100% | 100% | 1.00 |
| 5.3.1.24 | 99% | 100% | 93% | 0,96 | 98% | 95% | 82% | 0.87 |
| 6.3.4.3 | 100% | 100% | 100% | 1,00 | 99% | 98% | 98% | 0.97 |
| Average | 100% | 100% | 97% | 0,98 | 99% | 98% | 94% | 0.95 |
All results were obtained with PP classifiers with 4 registers and 2 instructions per amino acid, except those of families 1.1.1.1 (which had 6 registers) and 1.8.4.11 (which had 2 registers and 1 instruction per amino acid).
Test results for EC family classification using Support Vector Machines
| EC family | Accuracy | Precision | Recall | MCC |
| 1.1.1.1 | 98% | 80% | 86% | 0.82 |
| 1.1.1.25 | 99% | 93% | 91% | 0.92 |
| 1.8.4.11 | 100% | 100% | 100% | 1.00 |
| 2.1.2.10 | 100% | 100% | 100% | 1.00 |
| 2.3.2.6 | 100% | 100% | 100% | 1.00 |
| 2.5.1.55 | 100% | 100% | 100% | 1.00 |
| 2.7.1.11 | 100% | 98% | 98% | 0.97 |
| 2.7.1.21 | 99% | 93% | 88% | 0.90 |
| 2.7.2.1 | 99% | 96% | 100% | 0.97 |
| 2.7.7.27 | 100% | 100% | 100% | 1.00 |
| 3.1.26.11 | 99% | 98% | 98% | 0.97 |
| 3.5.4.19 | 99% | 96% | 98% | 0.96 |
| 4.1.1.31 | 100% | 95% | 100% | 0.96 |
| 4.2.3.4 | 99% | 100% | 95% | 0.97 |
| 5.1.1.1 | 100% | 100% | 100% | 1.00 |
| 5.1.1.3 | 97% | 95% | 95% | 0.93 |
| 5.3.1.24 | 98% | 86% | 95% | 0.89 |
| 6.3.4.3 | 100% | 100% | 100% | 1.00 |
| Average | 99% | 96% | 97% | 0.96 |
All results were obtained with SVMs using polynomial kernels, except for families 2.5.1.55 and 2.7.7.27, which were obtained with radial basis kernels. The training results were perfect for all EC families.
Test results for EC family classification using BLAST
| EC family | Accuracy | Precision | Recall | MCC |
| 1.1.1.1 | 100% | 95% | 98% | 0.96 |
| 1.1.1.25 | 100% | 100% | 98% | 0.99 |
| 1.8.4.11 | 98% | 98% | 98% | 0.95 |
| 2.1.2.10 | 100% | 100% | 98% | 0.99 |
| 2.3.2.6 | 100% | 100% | 100% | 1.00 |
| 2.5.1.55 | 100% | 100% | 100% | 1.00 |
| 2.7.1.11 | 100% | 100% | 98% | 0.99 |
| 2.7.1.21 | 99% | 97% | 90% | 0.94 |
| 2.7.2.1 | 100% | 100% | 100% | 1.00 |
| 2.7.7.27 | 100% | 100% | 100% | 1.00 |
| 3.1.26.11 | 100% | 100% | 98% | 0.99 |
| 3.5.4.19 | 100% | 100% | 100% | 1.00 |
| 4.1.1.31 | 100% | 100% | 100% | 1.00 |
| 4.2.3.4 | 99% | 100% | 95% | 0.97 |
| 5.1.1.1 | 96% | 100% | 89% | 0.92 |
| 5.1.1.3 | 99% | 98% | 100% | 0.98 |
| 5.3.1.24 | 100% | 96% | 100% | 0.98 |
| 6.3.4.3 | 100% | 100% | 100% | 1.00 |
| Average | 99% | 99% | 98% | 0.98 |