| Literature DB >> 19361344 |
Adrian K Arakaki1, Ying Huang, Jeffrey Skolnick.
Abstract
BACKGROUND: We previously developed EFICAz, an enzyme function inference approach that combines predictions from non-completely overlapping component methods. Two of the four components in the original EFICAz are based on the detection of functionally discriminating residues (FDRs). FDRs distinguish between member of an enzyme family that are homofunctional (classified under the EC number of interest) or heterofunctional (annotated with another EC number or lacking enzymatic activity). Each of the two FDR-based components is associated to one of two specific kinds of enzyme families. EFICAz exhibits high precision performance, except when the maximal test to training sequence identity (MTTSI) is lower than 30%. To improve EFICAz's performance in this regime, we: i) increased the number of predictive components and ii) took advantage of consensual information from the different components to make the final EC number assignment.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19361344 PMCID: PMC2670841 DOI: 10.1186/1471-2105-10-107
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Prediction performance of the FDR-based and SVM-based approaches applied to Multiple Pfam enzyme families. For three-field (A, B) or four-field EC number classifiers (C, D), the average recall (A, C) and average precision (B, D) of the FDR-based (blue columns) and SVM-based (red columns) approaches is plotted at different intervals of maximal test to training sequence identity (MTTSI). The average of each performance indicator is done over all the EC numbers defined in the specified MTTSI interval (numbers at the bottom of each column). Details about the benchmark can be found in "Benchmarking of EFICAz2 version 10", in the Methods section. Statistically significant differences in performance are indicated by black lines under the corresponding columns (see "Statistical analyses", in the Methods section). Values on top of each column represent average +/- standard deviation.
Figure 2Prediction performance of the FDR-based and SVM-based approaches applied to CHIEFc enzyme families. For three-field (A, B) or four-field EC number classifiers (C, D), the average recall (A, C) and average precision (B, D) of the FDR-based (blue columns) and SVM-based (red columns) approaches is plotted at different intervals of maximal test to training sequence identity (MTTSI). The average of each performance indicator is done over all the EC numbers defined in the specified MTTSI interval (numbers at the bottom of each column). Details about the benchmark can be found in "Benchmarking of EFICAz2 version 10", in the Methods section. Statistically significant differences in performance are indicated by black lines under the corresponding columns (see "Statistical analyses", in the Methods section). Values on top of each column represent average +/- standard deviation.
Figure 3Prediction overlap of FDR-based and SVM-based methods. The fractions of test sequences (corresponding to the benchmark described in "Benchmarking of EFICAz2 version 10", in the Methods section) correctly predicted by three or four-field EC number classifiers applied to Multiple Pfam or CHIEFc enzyme families are represented. For combination of enzyme family and level of description of the classifiers, we show the fraction corresponding to unique predictions made by the FDR-based (blue) or SVM-based method (green), and the fraction corresponding to predictions made by both (orange) or none of the methods (yellow).
Figure 4Prediction performance of different EFICAz implementations. For three-field (A, B) or four-field EC number classifiers (C, D), the average recall (A, C) and average precision (B, D) of the original EFICAz (green columns), EFICAz plus the new SVM-based components (blue columns) and EFICAz2 (red columns) is plotted at different intervals of maximal test to training sequence identity (MTTSI). The average of each performance indicator is done over all the EC numbers defined in the specified MTTSI interval (numbers at the bottom of each column). Details about the benchmark can be found in "Benchmarking of EFICAz2 version 10", in the Methods section. Statistically significant differences in performance are indicated by black lines under the corresponding columns (see "Statistical analyses", in the Methods section). Values on top of each column represent average +/- standard deviation.
Figure 5Predictive models for EFICAz. Classification trees corresponding to three-field (A, B) and four-field EC numbers (C, D) to integrate predictions from each of the six EFICAz2 components for protein sequences that exhibit MTTSI < 30% (A, C) or MTTSI ≥ 30% (B, D). CHFDR = CHIEFc family based FDR recognition; PFFDR = Multiple Pfam family based FDR recognition; CHSIT = CHIEFc family specific SIT evaluation; Prst = High specificity multiple PROSITE pattern recognition; CHsvm = CHIEFc family based SVM evaluation; PFsvm = Multiple Pfam family based SVM evaluation.
Comparative enzyme function annotation of the human proteome(1)
| Level of detail of the enzyme function assignment: Three-field EC numbers | ||||||
| EFICAz2 predictions(2) | ||||||
| Annotation source | EC numbers with less than three fields(4): | Three-field EC numbers: 3,508/ | ||||
| EC numbers with less than three fields(4): | EFICAz2 novels: 798/ | |||||
| Level of EC annotation agreement(6) | ||||||
| KEGG annotations(3) | Annotation source | None | Partial | Full | ||
| Three-field EC numbers: 2,954/ | KEGG novels: 309/ | EFICAz2 | 18/ | 138/ | 2,554/ | |
| KEGG | 18/ | 73/ | ||||
| Level of detail of the enzyme function assignment: Four-field EC numbers | ||||||
| EFICAz2 predictions(2) | ||||||
| Annotation source | EC numbers with less than four fields(4): | Four-field EC numbers: 2,850/ | ||||
| EC numbers with less than four fields(4): | EFICAz2 novels: 522/ | |||||
| Level of EC annotation agreement(6) | ||||||
| KEGG annotations(3) | Annotation source | None | Partial | Full | ||
| Four-field EC numbers: 2,523/ | KEGG novels: 338/ | EFICAz2 | 49/ | 260/ | 2,019/ | |
| KEGG | 46/ | 120/ | ||||
(1) The source of the 24,305 human protein sequences is the KEGG Genes database Release 47.0+/06-26, of June 26, 2008.
(2) Predictions made by EFICAz2 version 13.
(3) Annotations obtained from the KEGG Brite database Release 47.0+/06-26, of June 26, 2008.
(4) Includes non-enzymes, considered as having zero-field EC numbers.
(5) Non-bolded font indicates number of annotations while bolded font refers to the number of annotated protein sequences (a single protein can display more than one enzymatic activity, thus, multiple EC numbers can be assigned to the same protein sequence).
(6) Here, we compare the agreement between annotations from KEGG and EFICAz2 that have the same level of detail, whether three-field or four-field EC numbers. Three different levels of agreement are considered: 1) Full: all EC numbers assigned to the protein by KEGG and EFICAz2 are identical, 2) Partial: at least one but not all the EC numbers assigned to the protein by KEGG and EFICAz2 agree, and 3) None: none of the EC numbers assigned to the protein by KEGG and EFICAz2 coincides.
Number of sequences in reference sets used for EFICAz2 training
| Reference sequence set | EFICAz2 version 10 | EFICAz2 version 13 |
| "non enzymes" | 132.342 | 174,898 |
| "enzymes" (all) | 94,028 | 136,167 |
| "enzymes" (three-field EC number) | 90,801 | 131,503 |
| "enzymes" (four-field EC number) | 76,698 | 111,577 |
Number of families and EC number types associated with different EFICAz2 predictive components
| Type of EFICAz2 component | Three-field EC numbers | Four-field EC numbers | ||
| EFICAz2 version 10 | EFICAz2 version 13 | EFICAz2 version 10 | EFICAz2 version 13 | |
| PFAM families | 2294/ | 2294/ | 2022/ | 2153/ |
| CHIEFc families | 2932/ | 2947/ | 3548/ | 3607/ |
| PROSITE patterns | 807/ | 1949/ | 527/ | 1368/ |
| All EFICAz2 components | ||||
(1) Non-bolded font indicates number of families or patterns while bolded font refers to the number of different EC number types recognized by the indicated category of EFICAz2 predictive component.
Figure 6Distribution of the number of test sequences per enzyme type. Distribution of 9,397 test enzyme sequences into 145 types of three-field EC numbers (green columns) and 6,996 test enzyme sequences into 614 types of four-field EC numbers (red columns).