| Literature DB >> 17254297 |
H H Lin1, L Y Han, H L Zhang, C J Zheng, B Xie, Z W Cao, Y Z Chen.
Abstract
Metal-binding proteins play important roles in structural stability, signaling, regulation, transport, immune respon<span class="Chemical">se, metabolism control, and metal homeostasis. Because of their functional and sequence diversity, it is desirable to explore additional methods for predicting metal-binding proteins irrespective of sequence similarity. This work explores support vector machines (SVM) as such a method. SVM prediction systems were developed by using 53,333 metal-binding and 147,347 non-metal-binding proteins, and evaluated by an independent set of 31,448 metal-binding and 79,051 non-metal-binding proteins. The computed prediction accuracy is 86.3%, 81.6%, 83.5%, 94.0%, 81.2%, 85.4%, 77.6%, 90.4%, 90.9%, 74.9% and 78.1% for calcium-binding, cobalt-binding, copper-binding, iron-binding, magnesium-binding, manganese-binding, nickel-binding, potassium-binding, sodium-binding, zinc-binding, and all metal-binding proteins respectively. The accuracy for the non-member proteins of each class is 88.2%, 99.9%, 98.1%, 91.4%, 87.9%, 94.5%, 99.2%, 99.9%, 99.9%, 98.0%, and 88.0% respectively. Comparable accuracies were obtained by using a different SVM kernel function. Our method predicts 67% of the 87 metal-binding proteins non-homologous to any protein in the Swissprot database and 85.3% of the 333 proteins of known metal-binding domains as metal-binding. These suggest the usefulness of SVM for facilitating the prediction of metal-binding proteins. Our software can be accessed at the SVMProt server http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi.Entities:
Mesh:
Substances:
Year: 2006 PMID: 17254297 PMCID: PMC1764469 DOI: 10.1186/1471-2105-7-S5-S13
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Statistics of the datasets and prediction accuracy of individual class of metal-binding proteins and that of all metal-binding proteins. The predicted results are given in TP (true positive), FN (false negative), TN (true negative), FP (false positive), sensitivity SE = TP/(TP + FN) (accuracy for class members), specificity SP = TN/(TN + FP) (accuracy for non-members), and overall accuracy Q = (TN + TP)/(TP + FN + TN + FP). The number of members and non-members in the testing and independent evaluation sets is TP + FN or TN + FP respectively
| Metal-Binding Protein Classes | Training set | Testing set | Independent evaluation set | ||||||||||
| positive | negative | positive | negative | positive | negative | Q(%) | |||||||
| TP | FN | TN | FP | TP | FN | SE(%) | TN | FP | SP(%) | ||||
| Calcium-binding | 1816 | 4389 | 2055 | 245 | 8039 | 1906 | 1130 | 180 | 86.3% | 6373 | 851 | 88.2% | 87.9% |
| Cobalt-binding | 568 | 2151 | 456 | 2 | 13407 | 10 | 360 | 81 | 81.6% | 7809 | 7 | 99.9% | 98.9% |
| Copper-binding | 652 | 1999 | 417 | 109 | 13115 | 270 | 390 | 77 | 83.5% | 7587 | 146 | 98.1% | 97.3% |
| Iron-binding | 3128 | 3428 | 4869 | 290 | 3992 | 675 | 1104 | 71 | 94.0% | 6328 | 598 | 91.4% | 91.7% |
| Magnesium-binding | 2583 | 4023 | 2307 | 594 | 7267 | 115 | 3412 | 792 | 81.2% | 5848 | 805 | 87.9% | 85.3% |
| Manganese-binding | 1608 | 3099 | 1146 | 217 | 10841 | 1086 | 1061 | 182 | 85.4% | 7148 | 415 | 94.5% | 93.2% |
| Nickel-binding | 407 | 2001 | 95 | 2 | 13576 | 138 | 156 | 45 | 77.6% | 7816 | 65 | 99.2% | 98.6% |
| Potassium-binding | 408 | 1845 | 489 | 10 | 13789 | 8 | 301 | 32 | 90.4% | 7847 | 4 | 99.9% | 99.6% |
| Sodium-binding | 777 | 2010 | 338 | 1 | 13591 | 30 | 410 | 41 | 90.9% | 7831 | 11 | 99.9% | 99.4% |
| Zinc-binding | 2731 | 6416 | 6610 | 569 | 5931 | 360 | 4616 | 1546 | 74.9% | 6289 | 127 | 98.0% | 86.7% |
| All metal-binding | 5013 | 3101 | 11806 | 1015 | 4217 | 522 | 12070 | 3391 | 78.1% | 4529 | 617 | 88.0% | 80.6% |
Prediction results of novel metal-binding proteins by SVMProt, where "+" represents proteins correctly predicted as metal-binding proteins, and "-" represents proteins incorrectly predicted as non-metal-binding proteins
| P04390 | - | P20910 | + | P43589 | - | Q09824 | - |
| O13826 | - | P22635 | - | P49412 | + | Q17374 | + |
| O13862 | + | P23382 | - | P49659 | + | Q44009 | + |
| O26638 | + | P23485 | + | P50534 | + | Q45488 | + |
| O29031 | + | P23657 | - | P52283 | - | Q52982 | - |
| O29156 | + | P23940 | + | P54355 | + | Q54450 | + |
| O29747 | - | P24005 | + | P54657 | + | Q56X52 | + |
| O42720 | - | P24059 | + | P56200 | - | Q59660 | - |
| O67672 | + | P24282 | - | P80479 | + | Q6F4C6 | + |
| O68557 | + | P26902 | + | P80509 | - | Q7Z2C4 | + |
| O75448 | + | P28875 | + | P81040 | + | Q80874 | + |
| O81916 | + | P31032 | + | P81242 | - | Q8GNT2 | + |
| P03697 | + | P31178 | + | P81605 | - | Q8VYR2 | + |
| P03825 | + | P32505 | + | P82604 | + | Q94702 | + |
| P12258 | + | P33353 | + | P83310 | - | Q95QY7 | - |
| P12608 | - | P33440 | + | Q00166 | + | Q9JJV3 | + |
| P14229 | + | P34806 | + | Q00167 | - | Q9LAI0 | - |
| P14633 | + | P39405 | - | Q00457 | + | Q9LIG0 | + |
| P19729 | + | P40379 | + | Q03471 | - | Q9VL31 | - |
| P19733 | + | P40685 | - | Q04580 | - | Q9WXE6 | + |
| P20050 | - | P40962 | + | Q06200 | + | Q9ZAA8 | + |
| P20193 | - | P40988 | + | Q08906 | + |
Distribution of metal-binding proteins in different kingdoms and in top 10 host species of each kingdom. Not all protein sequences studied in this work are included because the host species information of some protein sequences is not yet available in the protein sequence database
| 576 | 17040 | 13692 | 1618 | |
| Bacteriophage T4 (12) | Homo sapiens (2218) | Escherichia coli (551) | Methanococcus jannaschii (203) | |
| Orgyia pseudotsugata multicapsid polyhedrosis virus (10) | Mus musculus (1850) | Escherichia coli O157:H7 (264) | Methanobacterium thermoautotrophicum(103) | |
| Autographa californica nuclear polyhedrosis virus (9) | Rattus norvegicus (1013) | Bacillus subtilis (250) | Methanosarcina acetivorans (93) | |
| Mimivirus (9) | Arabidopsis thaliana (882) | Salmonella typhimurium (229) | Archaeoglobus fulgidus (92) | |
| Variola virus (6) | Saccharomyces cerevisiae (528) | Escherichia coli O6 (212) | Methanosarcina mazei (91) | |
| Vaccinia virus (strain Copenhagen) (6) | Drosophila melanogaster (455) | Haemophilus influenzae (205) | Halobacterium salinarium (75) | |
| Vaccinia virus (strain Western Reserve/WR) (6) | Caenorhabditis elegans (388) | Shigella flexneri (197) | Pyrococcus horikoshii (72) | |
| Vaccinia virus (strain Ankara) (6) | Bos Taurus (334) | Salmonella typhi (173) | Pyrococcus abyssi (71) | |
| Ictalurid herpesvirus 1 (5) | Schizosaccharom yces pombe (314) | Mycobacterium tuberculosis (164) Synechocystis sp. | Pyrococcus furiosus (70) | |
| African swine fever virus (strain BA71V) (5) | Gallus gallus (252) | (strain PCC 6803) (163) | Sulfolobus solfataricus (65) |
Distribution of different classes of metal-binding proteins (calcium-binding, magnesium-binding, iron-binding, and zinc-binding) in different kingdoms and in top 10 host species. Not all protein sequences studied in this work are included because the host species information of some protein sequences is not yet available in the protein sequence database
| Calcium-binding | Magnesium-binding | Iron-binding | Zinc-binding | |||||
| Kingdom or species | No. of proteins | Kingdom or species | No. of proteins | Kingdom or species | No. of proteins | Kingdom or species | No. of proteins | |
| Archaea | 73 | Archaea | 262 | Archaea | 381 | Archaea | 1048 | |
| Bacteria | 1092 | Bacteria | 2597 | Bacteria | 3743 | Bacteria | 6916 | |
| Eukaryota | 3897 | Eukaryota | 1081 | Eukaryota | 5248 | Eukaryota | 6464 | |
| Viridae | 343 | Viridae | 194 | Viridae | 29 | Viridae | 1466 | |
| Homo sapiens | 651 | Homo sapiens | 140 | Arabidopsis thaliana | 278 | Homo sapiens | 1121 | |
| Mus musculus | 499 | Mus musculus | 129 | Escherichia coli | 214 | Mus musculus | 911 | |
| Rattus norvegicus | 305 | Arabidopsis thaliana | 117 | Homo sapiens | 191 | Rattus norvegicus | 382 | |
| Arabidopsis thaliana | 186 | Rattus norvegicus | 69 | Mus musculus | 185 | Saccharomyces cerevisiae | 380 | |
| Bos taurus | 142 | Escherichia coli | 63 | Rattus norvegicus | 152 | Arabidopsis thaliana | 359 | |
| Gallus gallus | 103 | Saccharomyces cerevisiae | 55 | Drosophila melanogaster | 124 | Caenorhabditis elegans | 255 | |
| Drosophila melanogaster | 94 | Bacillus subtilis | 53 | Methanococcus jannaschii | 92 | Drosophila melanogaster | 237 | |
| Oryctolagus cuniculus | 82 | Escherichia coli O157:H7 | 45 | Saccharomyces cerevisiae | 88 | Schizosaccharomyces pombe | 221 | |
| Sus scrofa | 65 | Salmonella typhimurium | 44 | Escherichia coli O157:H7 | 87 | Escherichia coli | 172 | |
| Caenorhabditis elegans | 64 | Schizosaccharomyces pombe | 43 | Bacillus subtilis | 77 | Methanococcus jannaschii | 119 | |
Figure 1The sequence of a hypothetic protein for illustration of derivation of the feature vector of a protein. Sequence index indicates the position of an amino acid in the sequence. The index for each type of amino acids in the sequence (A or E) indicates the position of the first, second, third, ... of that type of amino acid (The position of the first, second, third, ..., A is at 1, 3, 4, ...). A/E transition indicates the position of AE or EA pairs in the sequence.