| Literature DB >> 16776838 |
Shih-Hau Chiu1, Chien-Chi Chen, Gwo-Fang Yuan, Thy-Hou Lin.
Abstract
BACKGROUND: The number of sequences compiled in many genome projects is growing exponentially, but most of them have not been characterized experimentally. An automatic annotation scheme must be in an urgent need to reduce the gap between the amount of new sequences produced and reliable functional annotation. This work proposes rules for automatically classifying the fungus genes. The approach involves elucidating the enzyme classifying rule that is hidden in UniProt protein knowledgebase and then applying it for classification. The association algorithm, Apriori, is utilized to mine the relationship between the enzyme class and significant InterPro entries. The candidate rules are evaluated for their classificatory capacity.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16776838 PMCID: PMC1552092 DOI: 10.1186/1471-2105-7-304
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Five distinctly taxonomic datasets referring to the NEWT were used for generating and evaluating rules.
| A | B | C | D | E | |
| Training Instances | 3251 | 7522 | 3666 | 1791 | 4502 |
| Training Attributes | 657 | 784 | 823 | 589 | 551 |
| Testing Instances | 3440 | 10226 | 1759 | 1551 | 5022 |
| Testing Attributes | 777 | 1054 | 491 | 212 | 507 |
A: actinobacteria B: bacillales C: fungi D: nematode + arthropoda E: viridiplantae
Figure 1Input file to the Weka program. The false attribute was replaced with a "?" mark as a msising datum to prevent the generation of useless association rules.
Subset of rules generated from the fungus training dataset. The complete set of rules was shown in the additional file [see Additional file 1].
| Association rules | EC ID |
| IPR000873,IPR001031,IPR001242,IPR006163 | 6.3.2.26 |
| IPR001031,IPR001242,IPR006163 | 6.3.2.26 |
| IPR000873,IPR001031,IPR006163 | 6.3.2.26 |
| IPR000873,IPR001031,IPR001242 | 6.3.2.26 |
| IPR002314,IPR002317 | 6.1.1.11 |
| IPR001926,IPR002028 | 4.2.1.20 |
| IPR001031,IPR001242 | 6.3.2.26 |
| IPR000873,IPR001031 | 6.3.2.26 |
| IPR000850,IPR007862 | 2.7.4.3 |
| IPR007862 | 2.7.4.3 |
| IPR004308 | 6.3.2.2 |
| IPR003171 | 1.5.1.20 |
| IPR002934 | 2.7.7.19 |
| IPR000873,IPR001031,IPR001242,IPR006164 | 6.3.2.26 |
| IPR001031,IPR001242,IPR006164 | 6.3.2.26 |
| IPR000873,IPR001031,IPR003679 | 6.3.2.26 |
| IPR000873,IPR001031,IPR008600 | 6.3.2.26 |
| IPR002314,IPR002318 | 6.1.1.12 |
| IPR001926,IPR002029 | 4.2.1.21 |
| IPR001031,IPR001243 | 6.3.2.26 |
| IPR000873,IPR001032 | 6.3.2.26 |
| IPR000850,IPR007863 | 2.7.4.3 |
| IPR007862 | 2.7.4.3 |
| IPR004308 | 6.3.2.3 |
| IPR000873,IPR001031,IPR001242,IPR006164 | 6.3.2.26 |
| IPR001031,IPR001242,IPR006164 | 6.3.2.26 |
| IPR000873,IPR001031,IPR003679 | 6.3.2.26 |
| IPR000873,IPR001031,IPR008600 | 6.3.2.26 |
| IPR002314,IPR002318 | 6.1.1.12 |
| IPR001926,IPR002029 | 4.2.1.21 |
Number of rules and classified EC generated from the training dataset.
| A | B | C | D | E | |
| Rules | 624 | 607 | 920 | 1096 | 428 |
| EC | 254 | 229 | 167 | 168 | 153 |
| multiple domain rule | 40% | 43% | 69% | 72% | 42% |
A: actinobacteria B: bacillales C: fungi D: nematode + arthropoda E: viridiplantae
Evaluation of the generated candidate rules. The testing dataset was used to validate the corresponding set of rules. For instance, the fungus testing dataset was used to evaluate the set of rules generated from the fungus training dataset.
| A | B | C | D | E | |
| precision | 71% | 76% | 87% | 88% | 77% |
| confidence | 69% | 74% | 85% | 85% | 75% |
| coverage* | 43% | 38% | 60% | 54% | 56% |
A: actinobacteria B: bacillales C: fungi D: nematode + arthropoda E: viridiplantae
*: coverage = the hit ratio of testing data
Cross validation of the rule sets. The fungus testing dataset was used to evaluate the rule sets generated from the A, B, C, D, and E training datasets.
| A rule set | B rule set | C rule set | D rule set | E rule set | ||||||
| precision | confidence | precision | confidence | precision | confidence | precision | confidence | precision | confidence | |
| A testing data | 71% | 69% | 72% | 69% | 59% | 56% | 46% | 42% | 52% | 48% |
| B testing data | 68% | 67% | 76% | 74% | 61% | 59% | 43% | 41% | 51% | 49% |
| C testing data | 72% | 68% | 69% | 65% | 87% | 85% | 79% | 76% | 66% | 62% |
| D testing data | 66% | 61% | 43% | 38% | 74% | 72% | 88% | 85% | 68% | 65% |
| E testing data | 52% | 50% | 65% | 62% | 60% | 58% | 64% | 62% | 77% | 75% |
A: actinobacteria B: bacillales C: fungi D: nematode + arthropoda E: viridiplantae
The five datasets used to evaluate the rules parsed from the InterPro database.
| the parsed rules# | |||
| precision | Confidence | coverage* | |
| A testing data | 52% | 48% | 23% |
| B testing data | 51% | 50% | 24% |
| C testing data | 56% | 52% | 26% |
| D testing data | 47% | 41% | 20% |
| E testing data | 64% | 62% | 47% |
A: actinobacteria B: bacillales C: fungi D: nematode + arthropoda E: viridiplantae
*: coverage = the hit ratio of testing data
#: The dataset was parsed from the entry xref table of the InterPro database. The IPR Acc's were corresponding to ENZYME.
Accuracy of the single domain rules divided from the five rule sets.
| A | B | C | D | E | |
| precision | 68% | 71% | 87% | 85% | 77% |
| confidence | 65% | 70% | 85% | 82% | 75% |
| coverage* | 31% | 28% | 48% | 41% | 46% |
A: actinobacteria B: bacillales C: fungi D: nematode + arthropoda E: viridiplantae
*: coverage = the hit ratio of testing data
Accuracy of the multiple domain rules divided from the five rule sets.
| A | B | C | D | E | |
| precision | 79% | 87% | 87% | 97% | 76% |
| confidence | 75% | 85% | 82% | 93% | 72% |
| coverage* | 12% | 10% | 11% | 13% | 10% |
A: actinobacteria B: bacillales C: fungi D: nematode + arthropoda E: viridiplantae
*: coverage = the hit ratio of testing data
Examples of the matching entries which were not annotated with an EC class in the remaining fungus dataset of Swiss-Prot entries were predicted using the fungus rule set.
| Swiss-Prot ID | Description | predicted ec | lift score |
| P38811 | Transcription-associated protein 1 (p400 kDa component of SAGA). | 2.7.1.137 | 523.71 |
| P23202 | URE2 protein. | 2.5.1.18 | 523.71 |
| Q00717 | Putative sterigmatocystin biosynthesis protein stcT. | 2.5.1.18 | 523.71 |
| Q6BM74 | URE2 protein. | 2.5.1.18 | 523.71 |
| Q7LLZ8 | URE2 protein. | 2.5.1.18 | 523.71 |
| Q8NJR4 | URE2 protein. | 2.5.1.18 | 523.71 |
| Q8NJR5 | URE2 protein. | 2.5.1.18 | 523.71 |
| Q96WL3 | URE2 protein. | 2.5.1.18 | 523.71 |
| Q96X43 | URE2 protein. | 2.5.1.18 | 523.71 |
| Q96X44 | URE2 protein. | 2.5.1.18 | 523.71 |
| P43589 | Hypothetical 52.2 kDa protein in MPR1-GCN20 intergenic region. | 3.1.2.15 | 122.2 |
| O42908 | Hypothetical protein C119.17 in chromosome II. | 3.4.24.64 | 192.95 |
| Q10068 | Hypothetical protein C3H1.02c in chromosome I. | 3.4.24.64 | 192.95 |
| Q12496 | Hypothetical 118.4 kDa protein in WRS1-PKH2 intergenic region. | 3.4.24.64 | 192.95 |
| P39994 | Hypothetical 61.3 kDa protein in URA3-MMS21 intergenic region. | 4.1.1.1 | 261.86 |
| P43546 | Hypothetical 16.6 kDa protein in THI5-AGP3 intergenic region. | 1.1.1.- | 112.42 |
| P38169 | Hypothetical 52.4 kDa protein in ATP1-ROX3 intergenic region precursor. | 1.14.99.7 | 523.71 |
| P10662 | Mitochondrial 40S ribosomal protein MRP1. | 1.15.1.1 | 107.82 |
| P47141 | Hypothetical 30.2 kDa protein in YUH1-URA8 intergenic region. | 1.15.1.1 | 107.82 |
| P53109 | Hypothetical 65.8 kDa protein in SUT1-RCK1 intergenic region. | 1.16.1.7 | 407.33 |
| P36168 | Hypothetical 137.5 kDa protein in MPL1-PPC1 intergenic region. | 1.2.1.3 | 174.57 |
| P38992 | SUR2 protein (Syringomycin response protein 2). | 1.3.3.- | 305.5 |
| P36168 | Hypothetical 137.5 kDa protein in MPL1-PPC1 intergenic region. | 1.5.1.12 | 174.57 |
| P40215 | Hypothetical 62.8 kDa protein in RPS16A-TIF34 intergenic region. | 1.8.1.9 | 111.09 |
| P52923 | Hypothetical 41.3 kDa protein in HXT17-COS10 intergenic region. | 1.8.1.9 | 111.09 |
| P14908 | Mitochondrial replication protein MTF1 (Mitochondrial transcription factor mtTFB) (RF1023) (Mitochondrial specificity factor). | 2.1.1.- | 203.67 |
| P87250 | Mitochondrial replication protein MTF1 (Mitochondrial transcription factor MTTFB). | 2.1.1.- | 203.67 |