| Literature DB >> 33963861 |
Davide Baldazzi1, Castrense Savojardo1, Pier Luigi Martelli1, Rita Casadio1,2.
Abstract
The Bologna ENZyme Web Server (BENZ WS) annotates four-level Enzyme Commission numbers (EC numbers) as defined by the International Union of Biochemistry and Molecular Biology (IUBMB). BENZ WS filters a target sequence with a combined system of Hidden Markov Models, modelling protein sequences annotated with the same molecular function, and Pfams, carrying along conserved protein domains. BENZ returns, when successful, for any enzyme target sequence an associated four-level EC number. Our system can annotate both monofunctional and polyfunctional enzymes, and it can be a valuable resource for sequence functional annotation.Entities:
Year: 2021 PMID: 33963861 PMCID: PMC8262719 DOI: 10.1093/nar/gkab328
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Workflow of BENZ WS. For a query sequence, in FASTA format, the annotation procedure starts with HMM filtering. If the retaining HMM is plurivocally associated to different references sequences (blue star), a dendrogram is generated to find among the reference sequences the most similar one to the target. Otherwise (yellow star), the target is associated to the only reference. The EC number-query sequence association is then made after evaluating if the reference protein architecture (Ref Seq Arch) is contained (⊆) in that of the predicted target Pfam architecture (Query Pred Arch), focusing on Pfams carrying relevant sites. Pfams in our system are annotated when possible, with the positions of the active site, ligand binding site and metal binding site (relevant sites). A sequence feature viewer allows the user to verify whether the query sequence conserves the residues relevant to the protein catalysis for validating the transfer of annotation from the reference sequence. Links to the reference sequence UniProt/SwissProt file, structure PDB file and Pfam entries, together with KEGG identifiers and pathways are also present in the output (see HELP, https://benzdb.biocomp.unibo.it/help).
BENZ statistics
| EC 1 | EC 2 | EC 3 | EC 4 | EC 5 | EC 6 | EC 7 | Total | |
|---|---|---|---|---|---|---|---|---|
| EC numbersa | 1437 | 1550 | 1034 | 595 | 254 | 189 | 77 | 5136 |
| Cluster HMM | 1758 | 5116 | 3755 | 1006 | 637 | 636 | 288 | 12 612 |
| Cluster HMM GOLD | 1326 | 4315 | 3190 | 800 | 497 | 496 | 218 | 10 547 |
| Cluster HMM BLUE | 432 | 801 | 565 | 206 | 140 | 140 | 70 | 2065 |
| Ref Seqb | 2752 (390) | 6455 (990) | 4582 (729) | 1348 (324) | 842 (145) | 883 (105) | 405 (14) | 16 593 (2023) |
| Ref Strb | 1230 (152) | 2252 (253) | 2100 (213) | 625 (110) | 333 (41) | 287 (22) | 149 (5) | 6798 (618) |
| Pfamc | 682 (429) | 1923 (769) | 1672 (711) | 463 (321) | 294 (190) | 276 (133) | 143 (50) | 4158 (1758) |
| KEGG IDd | 2390 | 5908 | 3770 | 1185 | 799 | 894 | 343 | 14 745 |
| KEGG Pathwayd | 2758 | 4812 | 2628 | 1972 | 952 | 1266 | 317 | 9382 |
| Organismse | 15 158 200 13 | 20 187 232 1 193 | 15 165 208 1 135 | 13 125 141 4 | 16 109 83 6 | 18 119 53 12 | 7 57 47 | 24 261 391 2 213 |
| Arc Bac Euk Vir | Arc Bac Euk Unk Vir | Arc Bac Euk Unk Vir | Arc Bac Euk Vir | Arc Bac Euk Vir | Arc Bac Euk Vir | Arc Bac Euk | Arc Bac Euk Unk Vir |
aFour-level EC numbers are distributed according to the 7 EC classes: EC1-Oxidoreductases, EC2-Transferases, EC3-Hydrolases, EC4-Lyases,EC5-Isomerases, EC6-Ligases, EC7-Translocases.
bRef Seq and Ref Str: number of reference sequences, and reference sequence with structure, respectively; number of polyfunctional enzymes are within brackets.
cPfam: models from Pfam (https://pfam.xfam.org); within brackets Pfams, where relevant sites (active, metal, ligand binding site) are annotated.
dKEGG ID: from UniProt annotation; KEGG pathway: from https://www.genome.jp/kegg/.
enumber of organisms detailed for each kingdom. Arc: Archaea; Bac: Bacteria; Euk: Eukaryota; Oth: Others; Vir: Viruses. Unk: unknown. Annotation source: UniProt. Grand Total: 891.
BENZ at work
| Dataset | Sequences (#) | Acce (%) | FNRf (%) | FPRg (%) |
|---|---|---|---|---|
| Positivea | 197 880 | 92.4 | 3.9 | - |
| Negativeb | 12 315 | 95.1 | - | 4.9 |
| Polyfunctionalc | 10 764 | 93.7 | 5.0 | - |
| TrEMBL-humand | 10 024 | 93.4 | 5.6 | - |
aPositive: the positive set contains complete SwissProt sequences without any PDB counterpart and annotated with only four-level EC number.
bNegative: the negative set comprises complete SwissProt sequence with a PDB counterpart, without EC codes.
cPolyfunctional: the set includes complete SwissProt sequence that are annotated with two or more four-level EC numbers.
dTrEMBL-human: the set contains complete TrEMBL sequences from Homo sapiens annotated with a four-level EC number.
e Acc (Accuracy) measures the number of proteins correctly assigned. For sets containing positive examples, it corresponds to the True Positive Rate as evaluated at the level of four: EC annotation. For the negative set, it corresponds to the True Negative Rate.
fFNR (False Negative Rate) measures the percentage of enzymes predicted as non-enzymes.
gFPR (False Positive Rate) measures the percentage of non-enzymes predicted as enzymes.
BENZ benchmarking
| Method | Data seta | TPRf (%) 1 level | TPRf (%) 2 level | TPRf (%) 3 level | TPRf (%) 4 level | FNRg (%) | FPRh (%) |
|---|---|---|---|---|---|---|---|
| BENZ WSb | Full | 87.5 | 87.5 | 87.5 | 85.0 | 12.2 | 3.0 |
| BENZ WSb | Reduced | 79.2 | 79.2 | 79.2 | 75.1 | 20.2 | 3.0 |
| ECPredc | Reduced | 43.7 | 34.7 | 23.8 | 13.1 | 45.6 | 12.2 |
| DEEPred | Reduced | 38.8 | 35.2 | 27.9 | 20.8 | 51.1 | 2.4 |
| EFICAz2.5.1e | Reduced | 33.6 | 33.1 | 31.1 | 16.7 | 63.7 | 1.6 |
aBenchmark datasets are extracted by comparing SwissProt releases 2020_3 and 2019_11. The full dataset includes 607 proteins that have gained EC annotation (7 EC classes); the reduced dataset includes a subset of 366 enzyme sequences without EC codes of the seventh class for comparing with the other predictors. Both datasets comprise 1013 non-enzyme sequences as negative examples.
bA BENZ WS version including only sequences and annotations available in the SwissProt release 2019_11 has been used for this test.
cECPred (15) has been downloaded from https://github.com/cansyl/ECPred and run in-house; it does not provide multiclass predictions and the best match between the output and the list of EC numbers has been considered for multiclass enzymes. It does not include enzymes of for EC class 7.
dDEEPre (16) predictions have been run on the webserver http://www.cbrc.kaust.edu.sa/DEEPre/ in modality ‘I’m not sure the sequence is an enzyme’; it does not provide multiclass predictions and the best match between the output and the list of EC numbers has been considered for multiclass enzymes. It does not include enzymes of the EC class 7.
eEFICAz2.5.1 (25) has been downloaded from https://sites.gatech.edu/cssb/eficaz2-5/ and run in-house; it does not include enzymes of EC class 7.
fTPR (True Positive Rate) measures the number of enzymes assigned to the correct EC class. TPRs have been evaluated at the level of four-level EC annotation.
gFNR (False Negative Rate) measures the percentage of enzymes predicted as non-enzymes.
hFPR (False Positive Rate) measures the percentage of non-enzymes predicted as enzymes.