Literature DB >> 34555022

Experimental and computational investigation of enzyme functional annotations uncovers misannotation in the EC 1.1.3.15 enzyme class.

Elzbieta Rembeza¹, Martin K M Engqvist¹.

Abstract

Only a small fraction of genes deposited to databases have been experimentally characterised. The majority of proteins have their function assigned automatically, which can result in erroneous annotations. The reliability of current annotations in public databases is largely unknown; experimental attempts to validate the accuracy within individual enzyme classes are lacking. In this study we performed an overview of functional annotations to the BRENDA enzyme database. We first applied a high-throughput experimental platform to verify functional annotations to an enzyme class of S-2-hydroxyacid oxidases (EC 1.1.3.15). We chose 122 representative sequences of the class and screened them for their predicted function. Based on the experimental results, predicted domain architecture and similarity to previously characterised S-2-hydroxyacid oxidases, we inferred that at least 78% of sequences in the enzyme class are misannotated. We experimentally confirmed four alternative activities among the misannotated sequences and showed that misannotation in the enzyme class increased over time. Finally, we performed a computational analysis of annotations to all enzyme classes in the BRENDA database, and showed that nearly 18% of all sequences are annotated to an enzyme class while sharing no similarity or domain architecture to experimentally characterised representatives. We showed that even well-studied enzyme classes of industrial relevance are affected by the problem of functional misannotation.

Entities: Chemical

Mesh：

Substances：

Year: 2021 PMID： 34555022 PMCID： PMC8491902 DOI： 10.1371/journal.pcbi.1009446

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

Introduction

With the steady increase of genetic information deposited to public databases, the proportion of experimentally characterised sequences continues to decline. At the time of writing the UniProt/TrEMBL protein database contains nearly 185 million entries, with only 0.3% of them having been manually annotated and reviewed in the Swiss-Prot database [1]. Furthermore, the experimentally characterised sequence diversity is limited, representing proteins mainly from eukaryotes and model organisms. As the traditional experimental methods for determining protein function cannot keep up with the increase in genomic data, high-throughput methods enabling protein family-wide substrate profiling for hundreds of enzymes are being implemented. Data generated in such approaches are important for understanding sequence-function relationships in the tested protein families; they have led to the discovery of novel enzymatic activities as well as identified enzymes with diverse physicochemical properties [2-6]. Additionally, several global initiatives have been undertaken to bring together computational and experimental scientists to accelerate discovery of novel protein activities and enable more trustworthy functional annotations [7-9]. In spite of new platforms enabling more efficient experimental protein characterisation, automated annotation methods form the basis for functional assignment of new proteins [10]. These methods commonly rely on inferring a function from sequence similarity to curated sequences or to already existing entries in a given database. Annotations can be transferred either as a free text description of a function, or as more structured vocabularies like Gene Ontology [11] or Enzyme Commision classifications. Sequence similarity-based annotation pipelines enable processing of vast amounts of newly sequenced data, however, it has been shown that if not applied appropriately, they result in erroneous functional annotations, which later percolate throughout databases [12-18]. In order to improve functional annotations and predict novel protein subtypes, more refined methods are constantly being developed. They exploit signatures of protein families and domains, orthology, patterns of functional divergence, or a mixture of all the approaches [19-22]. Still, the quality of functional annotations is considered far from perfect [23]. Existing reports on the misannotation issue in public databases estimate the annotation error to be between 5–80%, depending on the protein family and database, and indicate overprediction as the main cause of such errors [24,25]. It is worth noting that these reports are based on entries from over 15 years ago, before a rapid increase in genome sequencing projects caused by the rise of low-cost sequencing technologies. The reliability of present annotations in public databases is largely unknown. In this study we utilize a high-throughput experimental platform, similar to those used for substrate profiling of protein families, to verify functional annotations to an enzyme class in the BRENDA database [26]. We provide an overview of all the sequences annotated as S-2-hydroxyacid oxidases (EC 1.1.3.15) and select 122 representatives of the class for experimental screening of their predicted function. We show that the majority of the sequences contain non-canonical protein domains, do not catalyse the predicted reaction, and are wrongly annotated to the enzyme class. Among the misannotated sequences we confirm four alternative enzymatic activities. Finally, a computational analysis of all EC classes in BRENDA reveals that a large proportion of sequences are annotated to enzyme classes with no similarity to characterised enzymes, a problem which warrants further investigation.

Results

Exploration of EC 1.1.3.15 sequence space

Enzyme Commission (EC) classification is a numerical classification system for enzymes, based on the chemical reaction they catalyse and substrate they act upon. Different enzymes catalysing the same reaction receive the same EC number, regardless of their similarity in sequence or structure. A medium-size, easy to assay enzyme class 1.1.3.15 (S-2-hydroxyacid oxidase) was chosen for this proof-of-concept study. Representatives of the class oxidize the hydroxyl group of S-2-hydroxyacids like glycolate or lactate to 2-oxoacids, using oxygen as an electron acceptor (S1 Fig). All characterised enzymes of this class belong to a family of FMN-dependent α-hydroxy acid oxidases/dehydrogenases. Members of this protein family share high structural and functional similarities but differ in the ultimate electron acceptor: oxygen (S-2-hydroxyacid oxidase, EC 1.1.3.15; lactate monooxygenase, EC 1.13.12.4), cytochrome c (flavocytochrome b2, EC 1.1.2.3) or quinone (S-mandelate dehydrogenase, EC 1.1.99.31) [27-29]. A characteristic feature for S-2-hydroxyacid oxidases is their broad substrate scope in vitro, although the physiological substrate for plant and mammalian homologues is mainly glycolate or long chain hydroxyacids [30-32], while lactate is the main physiological substrate of bacterial homologues [33,34]. Members of EC 1.1.3.15 are of high biological importance, with plant GOX being crucial for photorespiration, mammalian HAOs taking part in glycine synthesis and fatty acid oxidation, and bacterial LOX metabolising L-lactate as an energy source [31]. Human HAO1 was recently proposed as a target for treating primary hyperoxaluria, an autosomal metabolic disorder leading to decline in renal function [35]. Bacterial LOX are of particular medical and industrial interest, being used for lactate biosensor development in clinical care, sport medicine, and food processing [36]. To obtain an overview of sequence diversity in EC 1.1.3.15, we downloaded all sequences annotated to this EC in BRENDA 2017.1 and obtained 1058 unique sequences after filtering out partial genes. The sequence interrelatedness of these diverse proteins was visualized in a multidimensional scaling (MDS) plot using computed UniRep embeddings [37]; a smaller distance in this plot indicates higher relatedness (Figs 1 and S2). Among the 1058 sequences 17 are characterised and/or manually curated enzymes: sequences listed in BRENDA [38] as experimentally tested or in SwissProt [1] as manually curated sequences having experimental evidence at protein level. Over 90% of the enzymes annotated to this enzyme class are of bacterial origin, nearly 6% of eukaryotic and 2.6% of archaeal (Fig 1A). Strikingly, 14 out of 17 characterised enzymes are of eukaryotic origin, showing a clear over-representation. The characterised sequences also cluster close together in the visualization (Fig 1A, 1B and 1C), indicating that the characterised/curated sequence diversity in EC 1.1.3.15 is limited.

Fig 1

Sequence space of sequences annotated to EC 1.1.3.15.

Sequence space of sequences annotated to EC 1.1.3.15.

Sequences listed in BRENDA and SwissProt as experimentally tested are encircled (A) Taxonomic origin of sequences. (B) Percentage of sequence identity to the closest experimentally tested or curated S-2-hydroxyacid oxidase. (C) Pfam domain architecture. (D) The mean alignment-based sequence identity between and within domain clusters. Pfam protein domains: FMN_dh (PF01070)—FMN-dependent dehydrogenase, DAO (PF01266)—FAD dependent oxidoreductase, Fer2_BFD (PF04324)—BFD-like [2Fe-2S] binding domain, FAD_binding_4 (PF01565)—FAD binding domain, FAD-oxidase_C (PF02913)—FAD linked oxidases C-terminal domain, CCG (PF02754)—cysteine-rich domain. We next determined the similarity of each sequence in EC 1.1.3.15 to the closest characterised S-2-hydroxyacid oxidase in terms of alignment-based sequence identity and domain architecture. Most sequences have little similarity with the characterised ones; 79% of sequences annotated as 1.1.3.15 share less than 25% sequence identity with the closest characterised/curated sequence (Figs 1B and S3). Furthermore, only 22.5% of the 1058 sequences are predicted to contain the FMN-dependent dehydrogenase domain (FMN_dh, PF01070) which is canonical for known 2-hydroxy acid oxidases (Fig 1C). The majority of sequences were predicted to contain non-canonical domains, such as FAD binding domains characteristic for FAD-dependant oxidoreductases (PF01266, PF01565, PF02913), as well as a cysteine rich domain (PF02754) and 2Fe-2S binding domain (PF04324). Many of the sequences with non-canonical domains form distinct clusters (Fig 1C). An analysis of alignment-based similarity between these domain clusters showed that the average sequence identity to the canonical FMN-dependent dehydrogenase domain cluster is below 16% for all clusters. An all versus all comparison revealed that no two clusters share more than 21% average sequence identity, while the identity of sequences within clusters ranges between 33% and 55% (Fig 1D). This analysis clearly shows that the enzyme class EC 1.1.3.15 contains a set of very diverse protein sequences, the majority of which have low identity to sequences with experimental evidence, and also lack protein domains characteristic of S-2-hydroxy acid oxidases.

Experimental characterisation of EC 1.1.3.15

Due to the large diversity of sequences annotated to EC 1.1.3.15 we carried on to experimental validation of their predicted activity. A total of 122 genes throughout the sequence space of the enzyme class were selected (S4 Fig, left panel), synthesised, cloned and recombinantly expressed in Escherichia coli in a high throughput set up. Out of the 122 proteins, 65 were in soluble state (53%), with archaeal and eukaryotic proteins being proportionally less soluble than bacterial proteins (S4 Fig, right panel). Despite representing only half of the sequences chosen for experimental characterisation, the soluble proteins were still distributed throughout the sequence space of EC 1.1.3.15 (S4 Fig, left panel). The 65 soluble proteins were tested for S-2-hydroxy acid oxidase activity in an Amplex Red peroxide detection assay with a set of six 2-hydroxy acids: glycolate, lactate, 2-hydroxyoctanoate, 2-hydroxydecanoate, mandelate, and 2-hydroxyglutarate (S5 Fig).

Characterisation of proteins carrying the canonical FMN-dh domain

We first investigated 24 proteins representing a cluster of 230 sequences containing the FMN_dh domain; these have the highest sequence identity to previously characterised 2-hydroxy acid oxidases (Figs 1C and 2A). Among them 14 proteins were active with a broad substrate range, as is characteristic for enzymes in EC 1.1.3.15, while 10 proteins were inactive. Bacterial sequences in the cluster were predominantly active with lactate, medium chain and aromatic 2-hydroxy acids, whereas the two active eukaryotic enzymes showed the highest activity with glycolate and lactate.

Fig 2

Characterisation of protein cluster with high sequence identity to previously characterised S-2-hydroxyacid oxidases.

Characterisation of protein cluster with high sequence identity to previously characterised S-2-hydroxyacid oxidases.

(A) Activity screen and protein characteristics. Dendrogram indicates protein relatedness. Superkingdoms: light purple—Bacteria, brown—Eukaryotes. Recorded activities are marked with squares, for proteins active with more than one substrate, the substrate preference is shaded with the highest activity for each enzyme scaled to 100%. Listed amino acids correspond to conserved residues in a glycolate oxidase from S. oleracea. The cartoons represent predicted domain and motif composition of the sequences, based on Pfam search. Domains lacking full Pfam alignment are represented with a sharp edge. FMN-binding domain (FMN_dh, PF01070) is marked in magenta, cytochrome b5-like heme binding domain (Cyt_B5, PF00173) is marked in green, and a prolonged stretch in loop4 is marked in blue. (B) Conserved amino acids of the active site of S-2-hydroxyacid oxidase mapped on a structure of glycolate oxidase from S. oleracea (PDB: 1GOX). Conserved residues are marked in blue, the FMN cofactor is marked in yellow, and the glycolate substrate in green. (C) Superimposed structures of the representatives of FMN-dependant 2-hydroxyacid oxidase/dehydrogenase family with their distinct motifs represented in a cartoon form: glycolate oxidase (magenta, PDB 1GOX), flavocytochrome b2 (green, PDB 1FCB), mandelate dehydrogenase (light blue, PDB 6BFG), lactate 2-monooxygenase (dark blue, PDB 6DVH). We next analysed whether the 24 investigated proteins contain the seven conserved amino acid residues involved in catalysis and substrate binding [32], both using a multiple sequence alignment and protein structure analysis (Fig 2A and 2B). In 12 of the 14 active proteins all seven residues are conserved (Fig 2A), whereas 8 of the 10 inactive proteins lack at least one of the conserved residues. Presence of the seven conserved amino acids is thus a strong–but not absolute–indication of S-2-hydroxyacid oxidase activity. As no investigation of folding of the purified proteins was performed, it is possible that the enzymes showing no activity, particularly the ones with all conserved residues present, were incorrectly folded. The seven active site residues are, however, conserved not only in S-2-hydroxyacid oxidases, but also among all the members of the FMN-dependant S-2-hydroxyacid oxidase/dehydrogenase family [28]. We therefore looked for sequence motifs indicating the presence of other family members in our selection (Fig 2C). Two of the screened proteins (B8MKR3 and B8MMC0 from Talaromyces stipitatus) contain a heme binding domain (PF00173) characteristic for flavocytochrome b2 L-lactate dehydrogenase (EC 1.1.2.3) [29] (Figs 2A and S6). These two proteins were tested in vitro for their ability to reduce cytochrome c, a physiological electron acceptor of flavocytochrome b2 L-lactate dehydrogenase. Indeed, the B8MKR3 protein displayed cytochrome b2 L-lactate dehydrogenase activity (S7 Fig). Additionally, four other proteins (E6SCX5 from Intrasporangium calvum, C9Y9E7 from a Curvibacter species and W6W585 from Rhizobium sp. CF080) contain a longer stretch in loop 4 characteristic for S-mandelate dehydrogenase (EC 1.1.99.31) and L-lactate 2-monooxygenase (EC 1.13.12.4) [27,28] (Figs 2A and S6). As seen in our Amplex Red assay, the four proteins display a high activity with mandelate, suggesting their native function may be as S-mandelate dehydrogenases, although further experiments are needed to determine this. Out of the 230 members of the FMN_dh cluster–with high sequence identity to previously characterised EC 1.1.3.15 enzymes–a total of 6 proteins (2.6%) are predicted to contain a heme binding domain and 50 (22%) contain a longer stretch in loop4, indicating that those sequences might be misannotated and would be better placed in other EC classes. However, a thorough biochemical and genetic characterisation of such enzymes is needed to test this hypothesis.

Characterisation of proteins carrying non-canonical domains

Next, we investigated the activity of 41 proteins not containing the canonical FMN-dh domain (Fig 1C), yet representing a full 78% of all sequences annotated to EC 1.1.3.15 in BRENDA. These proteins have only low sequence identity with previously characterised S-2-hydroxyacid oxidases (Fig 1B and 1D). Out of the 41 proteins, twelve come from the cluster predicted to contain a single FAD dependent oxidoreductase domain (DAO, PF01266). Six of the twelve solely oxidised the substrate L-2-hydroxyglutarate in the in vitro assay (Fig 3A). This narrow substrate scope is atypical for the previously known broad substrate-range EC 1.1.3.15 enzymes, which indicates an alternative native function of these proteins. Our findings are supported by those of a recent publication where activity of an E. coli homologue of the 6 DAO-containing proteins was described as L-2-hydroxyglutarate dehydrogenase (EC 1.1.99.2/ EC 1.1.5.13) [39]. As the Amplex Red activity assay used in our activity screen is designed to capture oxidase activity via hydrogen peroxide detection, we may have detected a low level of non-physiological oxidase activity of the 6 L-2-hydroxyglutarate dehydrogenases (see further discussion on the AR assay a few paragraphs below).

Fig 3

Characterisation of protein clusters with low sequence identity to previously characterised S-2-hydroxyacid oxidases.

Characterisation of protein clusters with low sequence identity to previously characterised S-2-hydroxyacid oxidases.

Dendrogram indicating protein relatedness. Superkingdoms: light purple—Bacteria, dark purple—Archaea. Activities are marked with squares; for proteins active with more than one substrate, the substrate preference is shaded. The cartoons represent predicted domain and motif composition of the sequences, based on Pfam search. Domains lacking full Pfam alignment are represented with a sharp edge. Proteins with alternative activities chosen for kinetic characterisation are marked in bold. (A) Characterisation of protein clusters containing DAO domain. FAD dependent oxidoreductase domain (DAO, PF01266) is marked in blue, BFD-like [2Fe-2S] binding domain (Fer2_BFD, PF04324) is marked in purple. (B) Characterisation of remaining protein clusters. FAD binding domain (FAD_binding_4, PF01565) is marked in orange, FAD linked oxidases C-terminal domain (FAD-oxidase_C, PF02913) is marked in green, cysteine rich domain (CCG, PF02754) is marked in red. (C) Comparison of Pfam domains of sequences annotated to EC 1.1.3.15 in BRENDA version 2017.1 and 2019.2. The remaining 29 sequences of the “non-canonical” clusters–containing either a BFD-like [2Fe-2S] binding domain (Fig 3A), or a FAD linked oxidases C-terminal domain, either alone or combined with a cysteine-rich domain (Fig 3B)–were either inactive or did not display consistent substrate preferences (Fig 3A and 3B). We hypothesised that due to the non-canonical domain architecture and low sequence identity to characterised enzymes, these proteins may catalyse reactions different from the ones initially tested. By searching database information regarding the Pfam [40] domains and combining this information with orthology-based annotations and literature search, we found that some of these sequences are similar to dehydrogenases operating on four distinct substrates: glycerol-3-phosphate, glycolate, D-lactate and D-2-hydroxyglutarate dehydrogenase. In order to test whether the remaining 29 proteins catalyse these alternate reactions, we expressed and purified them, and the 22 successfully purified proteins were screened for the expected dehydrogenase activities with a set of common electron acceptors: nicotinamide adenine dinucleotide (NAD), nicotinamide adenine dinucleotide phosphate (NADP), the redox dye 2,6-Dichlorophenolindophenol (DCPIP), as well as the hydrogen peroxide probe Amplex Red (AR), and in selected cases cytochrome c (S8 Fig). When screened with DCPIP and AR, one protein was found to be active with glycerol-3-phosphate as a substrate (A0A0R3K2G2 from Caloramator mitchellensis), one with D-lactate (D4MUV9 from Anaerostipes hadrus) and one with D-2-hydroxyglutarate (A0A077SBA9 from Xanthomonas campestris). Additionally, three proteins (A0A0U5JSS4 from a Clostridium species, D4XIR1 from Achromobacter piechaudii, Q5WIP4 from Bacillus clausii) were active with each of the three substrates only in the AR screen (S8 Fig). None of the proteins were active with the electron acceptors NAD, NADP, or cytochrome c. The fact that some of the tested enzymes show activity with both AR and DCPIP is counter-intuitive as AR is a H2O2-dependent reporter, indicating that molecular oxygen is the electron acceptor, whereas DCPIP accepts electrons directly. Comparing standard curves of the two reporter molecules DCPIP and resorufin (the AR reaction product) revealed that the AR assay is several orders of magnitude more sensitive than DCPIP, on a molar basis (S9A Fig). We then carried out a direct comparison of enzyme activity in four purified enzymes using the DCPIP and AR assays. While the AR-dependent assay clearly gave the strongest signal, the enzymes displayed fifty to one hundred times higher catalytic rates in the DCPIP-based one (S9B Fig). Dehydrogenase activity is thus the prevalent one for the tested enzymes, although we were able to capture their trace oxidase activity. Overall, our screen of the non-canonical clusters revealed their erroneous annotation as EC 1.1.3.15, and we found four alternative activities among those sequences: L-2-hydroxyglutarate dehydrogenase, D-2-hydroxyglutarate dehydrogenase, D-lactate dehydrogenase, and glycerol-3-phosphate dehydrogenase. Four representatives with the alternative activities were chosen for further characterization (Fig 3A and 3B, in bold); they were expressed, purified (S10A Fig), assayed at 25°C and their kinetic parameters calculated (Table 1 and S10B Fig). Three of the four enzymes (D4MUV9, A0A077SBA9, S2DJ52) showed good catalytic efficiency with substrate affinities in the low micromolar range and kcat/KM values above 1 x 10 M-1s-1, strengthening the possibility that these may be the natural substrates. Additionally, based on reports of a homologous protein [41], the protein A0A077SBA9 was screened and showed modest side activity with D-malate. The fourth enzyme, A0A0R3K2G2, showed affinity for glycerol-3-phosphate in the low millimolar range, but with kcat/KM values approximately 100-fold lower than the other enzymes. Since this protein comes from the thermophilic bacterium Caloramator mitchellensis, whose optimal growth temperature is 55°C, we speculate that its catalytic efficiency would be higher at higher experimental temperatures.

Table 1

Kinetic parameters of selected proteins with functions distinct from S-2-hydroxyacid oxidase.

Values represent mean averages (+/- standard error of mean; n = 3).

Enzyme	Substrate	K_M [M]	k_cat [s^-1]	k_cat/K_M [M^-1s^-1]
D4MUV9	D-lactate	0.40 +/- 0.04 x 10⁻³	5.180	1.31 x 10⁴
A0A077SBA9	D-2-hydroxyglutarate	0.08 +/- 0.01 x 10⁻³	5.957	7.29 x 10⁴
A0A077SBA9	D-malate	5.03 +/- 1.38 x 10⁻³	0.039	7.78
S2DJ52	L-2-hydroxyglutarate	0.22 +/- 0.02 x 10⁻³	3.719	1.68 x 10⁴
A0A0R3K2G2	glycerol-3-phosphate	1.97 +/- 0.23 x 10⁻³	0.242	1.23 x 10²

Kinetic parameters of selected proteins with functions distinct from S-2-hydroxyacid oxidase.

Values represent mean averages (+/- standard error of mean; n = 3). Taken together, our results indicate that proteins which do not contain the canonical FMN-dh domain, which represent 78% of all proteins annotated to EC 1.1.3.15 in BRENDA, likely have in vitro catalytic activities that do not match their current EC classification. It is difficult to assess with certainty why these sequences were annotated to the EC 1.1.3.15 in the first place, and we can only speculate the origins of the misannotations. L-2-hydroxyglutarate dehydrogenase upon its discovery was incorrectly characterised as an oxidase [42], and thus received an incorrect assignment to EC 1.1.3.15. It is possible that all the similar proteins containing DAO domain, including glycerol-3-phosphate dehydrogenase-like proteins, followed the incorrect annotation. The misannotation of D-lactate dehydrogenase, D-2-hydroxyglutarate dehydrogenase and as a result other proteins containing FAD_binding_4 and FAD-oxidase_C might stem from the fact that the E. coli homolog, encoded by genes in the glcDEFGB operon, was initially believed to be a glycolate oxidase belonging to the enzyme class EC 1.1.3.1 [43-45], which was later merged with EC 1.1.3.15.

Analysing annotation error in the BRENDA database

Biological databases are dynamic by nature and receive regular updates with new experimental information as well as additional proteins from sequenced genomes. We therefore investigated how the annotations to EC 1.1.3.15 changed over time. In our analysis we compared Pfam domains of sequences annotated to the class in BRENDA 2017.1 and BRENDA 2019.2 (Fig 3C). Over the course of 2.5 years, representing five database versions, the enzyme class grew markedly from 601 sequences to 1659 (excluding redundant and partial sequences). However, the number of sequences containing the canonical FMN-dh domain actually decreased by 11, whereas the newly added sequences are part of clusters containing “non-canonical” protein domains. The most striking rise in sequences in this time period, from 24 to 220 sequences, appeared in the cluster shown by us to contain proteins displaying glycerol-3-phosphate dehydrogenase activity (Pfam domains DAO and Fer2_BFD) in vitro as well as that containing the L-2-hydroxyglutarate dehydrogenases (Pfam domain DAO), which rose from 379 to 650 sequences. This comparison clearly shows that, in the EC 1.1.3.15 enzyme class, the misannotations from old database versions were perpetuated to newly added homologous sequences. Based on the number of sequences lacking the canonical domain architecture alone (absence of the canonical FMN dehydrogenase domain) we estimate that in 2017 at least 78% of sequences in EC 1.1.3.15 are unlikely to catalyse the predicted reaction, while in 2019 this number grew to 87%.

Investigation of alternative functional predictions for sequences annotated to EC 1.1.3.15 in the BRENDA database

In the BRENDA database the enzymatic function information is extracted manually from scientific literature, but the predicted annotations are imported from external protein databases [26]. In order to investigate if other annotation methods provide better functional predictions, we scanned the sequences annotated to EC 1.1.3.15 with HAMAP and EggNOG predictors (S1 File). HAMAP [46] classifies and annotates proteins using a collection of expert-curated protein family signatures and annotation rules, while EggNOG [22] is a tool based on fast orthology assignments using precomputed clusters and phylogenies. Both methods provided predictions for only a portion of the input sequences (74% HAMAP, 59% EggNOG), indicating that for some of the sequences there was no evidence for either EC 1.1.3.15, or any other functional prediction. The HAMAP scan provided no annotation that could be directly linked to the S-2-hydroxyacid oxidase activity. Instead, 241 sequences were linked to a function of L-lactate dehydrogenase (MF_011559), and 685 to a function of L-hydroxyglutarate dehydrogenase (ML_00990), which included sequences shown experimentally by us to be active with L-hydroxyglutarate, but also glycerol-3-phosphate (Figs 3 and S8). The EggNOG method assigned 292 sequences with S-2-hydroxyacid oxidase activity (EC 1.1.3.15), two sequences with L-hydroxyglutarate dehydrogenase activity (EC 1.1.99.2), 79 with L-lactate dehydrogenase (EC 1.1.2.3), as well as captured the glycerol-3-phosphate dehydrogenase activity (EC 1.1.5.3) for 54 sequences and D-lactate dehydrogenase activity (1.1.2.4) for 69 sequences. Neither of the methods predicted the experimentally confirmed activities of D4MUV9 (D-lactate dehydrogenase) and A0A077SBA9 (D-2-hydroxyglutarate dehydrogenase). These data show that the use of orthogonal methods of functional annotation can further aid in providing more accurate, if not perfect, functional predictions.

Exploration of functional annotations in other enzyme classes

In our initial analysis of EC 1.1.3.15 we observed that enzymes from eukaryotes had been disproportionately studied and that a large proportion of sequences annotated to the class shared little similarity with them (Fig 1). We next asked whether EC 1.1.3.15 is a special case, or whether these observations constitute a trend across all of BRENDA. To answer this question we first downloaded all protein sequences from BRENDA 2019.2 and determined which of these have experimental evidence in either BRENDA or SwissProt. We found 30 574 unique identifiers with experimental evidence in SwissProt and 31 287 in BRENDA, only 11 498 of which were overlapping between the two sources. Next, we determined, for each EC class in BRENDA, the degree of identity between each experimentally uncharacterised sequence with the most similar characterised/curated one. To decrease the effect of a large number of similar sequences from repeated sequencing of model organisms we clustered the sequences at 90% using CD-HIT [47] and carried out the subsequent analysis using the ~5.3 million cluster representatives only. As in EC 1.1.3.15 (Fig 1), this global analysis shows that the overwhelming majority of sequences in BRENDA are bacterial (Fig 4A), whereas the majority of experimentally characterised/curated enzymes are eukaryotic (Fig 4B). Furthermore, most enzyme classes have only a small number of characterised/curated enzymes (Fig 4C), indicating that the sequence diversity explored within each EC class is limited.

Fig 4

Exploration of functional annotation throughout all BRENDA enzyme classes.

Exploration of functional annotation throughout all BRENDA enzyme classes.

(A) The total number of representative protein sequences (after clustering at 90% identity) annotated to EC classes in BRENDA, which is approximately 5.3 million. (B) The total number of experimentally characterised/curated enzymes. (C) Histogram showing the number of characterised/curated enzymes per EC class (bin size of 1). Histograms showing the distribution of sequence identities between all 5.3 million cluster representatives and their closest characterised/curated enzyme for Archaea (D), Bacteria (E), and Eukaryota (F) (with a bin size of 1). Proteins which do not have the same Pfam domains as characterised/curated enzymes are coloured in grey. To analyse the similarity of experimentally uncharacterised sequences to characterised/curated ones we computed, for each EC class, the sequence identity of each cluster representative to the closest characterised enzyme. This analysis is analogous to the one carried out for EC 1.1.3.15 (Fig 1B). The results for all EC classes were aggregated and are presented in Fig 4D, 4E and 4F. In all three superkingdoms the identities roughly follow a normal distribution with a mean below 50% identity. Peaks at 0% represent enzymes for which no characterised homolog is known, and peaks at 100% represent enzymes that have themselves been characterised. We also note peaks around 18% identity, these represent the average pairwise identity of two randomly selected sequences within an EC class (S12 Fig). Strikingly, in each of the superkingdoms almost one fifth of sequences share less than 25% pairwise sequence identity with the closest characterised/curated enzyme–within their own EC class. Such sequences are likely to be incorrectly annotated to a given EC, considering that this is well below the level where function can be confidently transferred between homologous proteins [48,49]. Furthermore, 18% of all sequences, mainly the low-identity ones, are not predicted to have the same Pfam domains as the experimentally characterized enzymes (Fig 4D, 4E and 4F, grey bars), providing further evidence of their likely misannotation. Many such low-homology sequences are annotated even to ostensibly well-characterised enzyme classes with industrially relevant activities (Table 2).

Table 2

Overview of annotation to enzyme classes of industrial interest.

EC	Name	%id < 25%*	Number of characterised proteins**	Applications [50]
3.1.1.3	lipase	54.7	141	detergent, leather processing, pharmaceutical synthesis, degradation of crude oils and plastics
3.1.1.1	carboxylesterase	47.6	106	degradation of plastics
3.2.1.4	cellulase	30.6	191	pulp and paper processing, detergent
3.2.1.8	xylanase	29.9	210	animal feed processing, pulp and paper processing
3.2.1.1	alpha amylase	23.9	87	flour adjustment, detergent, leather processing
3.1.1.74	cutinase	10.2	28	detergent, degradation of plastics

* Percentage of sequences in the EC with less than 25% identity to the closest characterised enzyme of the EC

** Proteins listed as characterised in BRENDA database and/or with “experimental evidence at protein level” label in SwissProt

* Percentage of sequences in the EC with less than 25% identity to the closest characterised enzyme of the EC ** Proteins listed as characterised in BRENDA database and/or with “experimental evidence at protein level” label in SwissProt

Discussion

In this study we present experimental investigation of sequence space to explore misannotation in a single enzyme class. By assessing the in vitro catalytic activity of 122 sequences representative of EC 1.1.3.15 in a high-throughput screening experiment we uncovered enzymes which do not display the predicted activity (Figs 2 and 3). Indeed, among the tested enzymes we confirm four alternative catalytic activities which are not compatible with their current annotation. Using sequence homology and protein domain predictions we infer that at least 78% sequences in the enzyme class are possibly misannotated. In contrast to previous studies investigating annotation errors [24,25], our setup allowed us not only to estimate the error, but also to examine alternative functions of the misannotated sequences. Our experimental approach to the misannotation problem comes with a drawback of limited scope, as we describe in detail only one enzyme class, whereas bioinformatic approaches allow for much broader analysis. However, we argue that our setup is ideal for understudied enzyme classes, and protein families for which experimental evidence is scarce. The most comprehensive misannotation study so far provided a bioinformatic overview of annotation error in 37 enzyme families in database entries from 2005 [25]. All the analysed families were well-studied and no additional experimental evidence was required to conduct it. Schnoes and coworkers divided the types of misannotation into four categories: “no superfamily association”, “missing functionally important residues”, “superfamily association only”, “below trusted HMM cutoff”, and showed that the last category is the most prevalent cause of annotation error. This type of error, often called over-annotation, is particularly common in large, multigene families, where enzymes perform similar chemistries on different substrates [51]. In our analysis of EC 1.1.3.15 we also found examples of proteins annotated to the class without functional residues, as well as other members of the superfamily, however, it is the lack of superfamily association that was the main cause of misannotation. In the work by Schnoes et al., which was based on entries to public databases in 2005, only 3% of all sequences were considered misannotated due to the lack of sequence similarity to the gold standard of a superfamily. In our study we show that 15 years later this number is likely much higher now. Similarly to findings described by Schnoes and coworkers [25], we also found a tangible proof of misannotation of enzymes being accumulated, rather than corrected over time (Fig 3C). Although we did not explore all possible causes of misannotation for all enzyme classes, we show that 17.8% of all sequences annotated in BRENDA share less than 25% sequence identity to the nearest characterised/curated enzyme of the class, and thus are unlikely to perform the predicted function (Fig 4D, 4E and 4F). Similarly, 18.1% of all sequences do not have the same Pfam domains as characterised/curated enzymes from their enzyme class. This is another strong indicator for misannotation, although a portion of this percentage might be explained by missing domains in partially sequenced genes. It is also possible that some of those sequences indeed perform the predicted activity, however, the records of their experimental characterisation were not registered in BRENDA or SwissProt databases. In our work we chose to investigate functional annotations to the BRENDA database [38] as it is the premier database linking protein entries with biochemical data, and due to its status as an ELIXIR core data resource (https://elixir-europe.org/platforms/data/core-data-resources). In BRENDA, detailed enzymatic function information is extracted manually from scientific literature, but the predicted annotations are directly imported from UniProt and two of its databases: TrEMBL and Swiss-Prot. Whereas Swiss-Prot annotations are manually curated, and generally highly reliable [25], the TrEMBL section provides automatic and not reviewed annotations, accepting annotations provided during genome submissions, only some of which are corrected by an internal prediction system. Taking into consideration this close link between BRENDA database and UniProt, it is likely that the levels of misannotation to enzyme class shown in our study for the former database are very similar in the latter. It is worth noting, however, that UniProt itself contains a broader description of enzyme function, listing not only an EC number, but also links to other resources predicting protein families, domains, and molecular functions. Resources like InterPro [20], together with its associated databases, attempt to provide more accurate methods for functional annotation, using algorithms relying on protein family signatures or gene ontologies. We show that scanning BRENDA 1.1.3.15 entries with alternatives annotation predictors, HAMAP and EggNOG, provides largely different annotation results. These alternative annotations were in better, although not perfect, agreement with our experimental data than the ones proposed in BRENDA. This highlights the fact that the methods, and as a consequence reliability of functional annotations, vary widely between databases. With the ever-growing numbers of genomes being sequenced, the gap between experimentally characterised and automatically annotated genes will continue to grow. It is therefore vital that a complete coverage of functional data is available for automated annotation [52]. In our study we characterised four proteins annotated to EC 1.1.3.15 with alternative activities, and in all cases after a literature search we found articles describing homologous proteins with the same activities [39,41,53,54]. Only one article proposed an annotation transfer [39] which resulted in a recent re-annotation of the protein in UniProt (P37339 protein from E. coli, L-2-hydroxyglutarate dehydrogenase, EC 1.1.5.13). The remaining proteins are still not recorded in protein databases as being experimentally tested, and thus do not serve as a reliable base for function transfer. Secondary protein databases, such as UniProt or BRENDA, welcome users’ corrections, however, it is uncertain to what extent those options are actively used by the community and result in correction of annotations. Initiatives such as COMBREX DB, a database of experimentally validated gene annotations [9], or STRENDA, a guideline of standards for reporting enzymology data [55,56] could help to solve the problem, but only if the whole scientific community adopts these standards. As a response to this issue, the journal Biochemistry recently called on authors to include accession IDs for all proteins experimentally characterised in their manuscripts [52], a requirement that should certainly be adopted by other journals. We believe that a structured way of registering proteins characterised in high-throughput experiments should also be developed, and though the depth of protein characterisation in such approaches is limited, they can provide an excellent overview of the substrate scope of a large number of proteins. Incorrect gene annotations that accumulate over time might have serious consequences for exploration of novelty and understanding fundamentals of biological functions [23]. As shown by us, a number of enzymes with important biological functions were misannotated to the EC 1.1.3.15, including ones taking part in amino acid [39,41], glycerol [53], or lactate [54] metabolism. Even more proteins with functions yet to be discovered might be hidden among the misannotated sequences. The fields of systems biology [57], metabolic and enzyme engineering [58,59] also rely on accurate annotations, and improved methods for functional annotation are constantly being developed to meet their needs [20,22,60].

Methods

EC 1.1.3.15 sequence space analysis

All protein sequences from BRENDA (https://www.brenda-enzymes.org/, version 2017.1) were downloaded and their full UniRep embeddings [37], of 5700 values, were computed. Identical sequences were de-duplicated and multidimensional scaling (MDS) was carried out on the remaining representations using the builtin function in Scikit-learn [61] to decrease the dimensionality of this representation to two, thus allowing visualization as a scatterplot (Fig 1). Taxonomic information for each sequence was obtained by searching for the source organism’s name in NCBI Taxonomy resource (https://www.ncbi.nlm.nih.gov/taxonomy). Sequences considered as “characterised” were obtained from UniProtKB/Swiss-Prot (https://www.uniprot.org/) as well as from BRENDA. Specifically, all protein identifiers from UniProtKB/Swiss-Prot (version 2020_02) annotated as belonging to EC 1.1.3.15 and labelled with “Evidence at protein level” were used, as well as those occurring in the “Organism” table of the EC 1.1.3.15 html page in BRENDA (version 2019.1). Pairwise sequence alignments were carried out, using MUSCLE [62], between all 1.1.3.15 sequences. For each sequence the maximum identity to a characterised/curated one was retained (Fig 1B). Pfam protein domain information for each sequence was obtained from UniProtKB. For the domain architectures specified in Fig 1D the arithmetic mean of all pairwise identities was calculated, within each architecture, as well as between architectures.

Sequence selection for experimental testing

Protein sequences from all EC classes designated as being oxidoreductases acting on hydroxyl groups with oxygen as an electron acceptor (EC 1.1.3.-) were downloaded from BRENDA (version 2017.1) and processed as outlined below, but only sequences from 1.1.3.15 were tested here, the others being reserved for future work. To improve the quality of subsequent alignments, sequences shorter than 200 amino acids (61 total for EC 1.1.3.15) and longer than 580 (31 total for EC 1.1.3.15) were removed, as well as sequences with “X” in them (7 total for EC 1.1.3.15). An all versus all BLAST was carried out using plastp from BLAST+ [63] with standard settings, followed by clustering using the MCL algorithm [64] with standard settings, except for the inflation parameter -I, which was set to 1.4. This resulted in 17 clusters. A multiple-sequence alignment was created for each cluster using MUSCLE [62]. The Shannon entropy—a metric quantifying the degree of conservation at each position—was used to select a diverse and informative set of sequences for testing. The metric was calculated for each multiple sequence alignment and sequences were then iteratively selected such that each newly chosen one maximally increased the information gain; they were chosen to maximize the mutual information explained within each alignment. This iterative sequence selection was continued until 85% of the information in each cluster had been explained.

Cloning, expression of sequences and protein purification

Generated sequences were synthesised, cloned into the pET21a vector and sequenced-verified by Twist Bioscience. Between the gene sequence and vector backbone, a C-terminal linker was added (AAALEHHHH), which in combination with six histidines from an expression vector resulted in a deca-His-tag for improved protein purification. The 122 constructs used in this work were deposited to Addgene (https://www.addgene.org/) with IDs 163180–163301. High throughput expression, lysis and, when necessary, purification was carried according to the published protocol [65]. Briefly, expression was carried in E. coli BL21(DE3) cells, in 96-well deep well plates, in 1 ml autoinduction TB (Foremedium). After cell lysis, cells were spun down and supernatants analysed by SDS-PAGE followed by Coomassie staining (InstantBlues, Expedeon). Each sequence was expressed three times; a sequence was scored as soluble when the corresponding band was present on a gel in at least two expressions. The soluble fraction of the lysate was used for the screen of S-2-hydroxyacid oxidase activity, whereas affinity-purified proteins were used for the dehydrogenase activity screen and determination of kinetic parameters.

Activity assays

To screen for S-2-hydroxyacid oxidase activity, lysates of soluble proteins were assayed in the Amplex Red hydrogen peroxide detection assay (Fisher Scientific) with a selection of 2-hydroxyacids: glycolate, L-lactate, DL-2-hydroxyoctanoate, DL-2-hydroxyoctadecanoate, DL-mandelate, L-2-hydroxyglutarate. Each protein was assayed three times and was considered a hit if it was scored as soluble and active at least twice. 1 μl of soluble fraction of the lysate after protein expression was added to a reaction mixture containing 20 mM HEPES pH 7.4, 50 μM Amplex Red (Fisher Scientific), 0.1 U/ml horseradish peroxidase (HRP) and 1 mM of an appropriate substrate. Final reaction volume was 20 μl, and the assay was performed in black 384-well low volume plates (Greiner). After 30 minutes of incubation in the dark, the endpoint measurements were performed with an excitation filter of 544 nm and emission filter of 590 nm in a BMG Labtech FLUOstar Omega microplate reader. Each reaction was performed in triplicate. Values for non-specific activity in the absence of substrate were subtracted from experimental measurements. E. coli lysate from cells expressing BSA protein was used as a control to establish a limit of detection of the assay (meanBSA + 4*SDBSA). For the dehydrogenase activity screening and kinetic characterisation, proteins were purified by affinity purification, and assayed with a range of substrates and electron acceptors. Purified protein in the volume of 1 μl was added to a reaction mixture containing 20 mM HEPES pH 7.4, 2 mM of substrate and electron acceptor. L-lactate (cytochrome) dehydrogenase activity was tested with 0.1 mM cytochrome c as electron acceptor. Glycerol-3-phosphate dehydrogenase activity was tested with the following electron acceptors: 0.2 mM DCPIP + 3 mM PMS, 50 μM Amplex Red + 0.1U/ml HRP, 1mM NAD, 1mM NADP. 2-hydroxyacid dehydrogenase activity was tested with all the above electron acceptors, with the addition of 0.15 mM cytochrome c. Activity was measured in triplicate every 30 seconds over 15 minutes at 340 nm in the case of NAD and NADP, at 600 nm in the case of DCPIP/PMS, at 550 nm in the case of cytochrome c, and with excitation/emission filter of 544 nm/590 nm in the case of Amplex Red/HRP. Unspecific reduction of electron acceptor was monitored in controls lacking substrate, and the values were subtracted from experimental measurements. The kinetic values for four chosen proteins were determined at 25°C with DCPIP + PMS as electron acceptor and a varied range of substrate concentrations. Protein concentrations used for the assays were: 60 nM D4MUV9, 50 nM A0A077SBA9 with D-2-hydroxyglutarate, 1.3 μM A0A077SBA9 with D-malate, 25 nM S2DJ52, 660 nM A0A0R3K2G2. Activities were calculated using the extinction coefficient of DCPIP at 600 nm (20.7 mM-1cm-1). Comparison of DCPIP and AR reaction rates was carried for the four characterised proteins. Reactions rates were performed for both electron acceptors, using concentration values of proteins and substrates as listed above.

EC 1.1.3.15 annotation over time

All EC 1.1.3.15 sequences were downloaded from two BRENDA versions, differing by 2.5 years in their publication (versions 2017.1 and 2019.2). Identical sequences in each database version were de-duplicated, resulting in 1058 sequences from 2017.1 and 1659 sequences from 2019.2. Pfam domains for these sequences were obtained by querying UniProt using the protein identifiers, and mining the resulting page for domain data. The frequency of each domain was subsequently computed.

Exploration of alternative annotations

Sequences listed in the file “1_1_3_15_BRENDA_sequences_filtered_2017_1.fasta” were uploaded for the scans by HAMAP (https://hamap.expasy.org/hamap_scan.html) and eggNOG-mapper v2 (http://eggnog-mapper.embl.de/). Xlsx results files from the scans were downloaded.

Exploration of annotation quality throughout enzyme classes

A list of UniProt identifiers for enzymes considered “characterised” was compiled from SwissProt and BRENDA as described in the first Methods section. Protein sequences from all EC classes were downloaded from BRENDA (version 2019.2). Within each EC class, sequences were clustered to 90% identity using CD-HIT [47] with standard settings and a word size of 5. Cluster representatives were retained for subsequent analysis. Since the clustering had resulted in some “characterised” sequences to be removed (they were not cluster representatives) these were added back. For every cluster representative within each EC class the sequence identity to the closest characterised/curated sequence (within that class) was computed. First, an alignment-free measure of similarity was obtained using the alfpy package [66] by computing count-based k-tuples with word size of 3 and Normalised Google Similarity [67] as a distance measure (S11 Fig). For each uncharacterised-characterised pair with highest k-tuple-based similarity, pairwise sequence alignments were created using MUSCLE and the sequence identities calculated. These are the identities reported. The superkingdom of the source organism was obtained for each organism, firstly by matching the organism name with the NCBI-Taxonomy database, and secondly by querying UniProt using the protein identifiers. Pfam (release 33.1) domain information was obtained from the “Pfam-A.full.uniprot” file provided at the FTP site (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/). Two proteins were scored as having the same Pfam domains only in cases where all domains matched, but disregarding their order.

Software

Scripts used to analyse data and generate manuscript figures are available as a GitHub (https://github.com) repository: https://github.com/EngqvistLab/analyze_1.1.3.15. All software packages, with their versions, are specified in a Miniconda (https://docs.conda.io) environment file in that repository. Briefly, analysis was carried out using the Python programming language version 3.7 (http://www.python.org), using the following packages: Biopython version 1.76 [68], Pandas version 1.0.1, Numpy version 1.18.1 [69], Matplotlib version 3.1.3 [70], Scikit-learn version 0.20.0 [61], TensorFlow version 1.15.0 (https://www.tensorflow.org/), Networkx version 2.5 (https://networkx.org/), Jupyter version 1.0.0 (https://jupyter.org/), Alfpy version 1.0.6 [66], BeautifulSoup4 version 4.9.3 (https://www.crummy.com/software/BeautifulSoup/). Additionally, the following standalone software was used: MUSCLE version 3.8.1551 [62], CD-HIT version 4.8.1 [47], MCL version 14.137 [64], BLAST+ version 2.5.0 [63], and UniRep [37].

Schematic representation of the reaction catalysed by S-2-hydroxyacid oxidases (EC 1.1.3.15).

(TIFF) Click here for additional data file.

A hexbin plot indicating Euclidean pairwise distances between the 1058 proteins annotated to EC 1.1.3.15.

Clustering along the diagonal indicates that the multidimensional scaling (MDS) dimensionality reduction faithfully represents pairwise distances of the UniRep representations of these sequences. The total number of pairwise distances is indicated, corresponding to half of the distance matrix, without the diagonal. (TIFF) Click here for additional data file.

Identity of sequences annotated as EC 1.1.3.15 to the closest characterised S-2-hydroxyacid oxidase.

(TIFF) Click here for additional data file.

Distribution of the insoluble, active and inactive screened proteins throughout the sequence space (left panel) and superkingdoms (right panel).

(TIFF) Click here for additional data file.

S-2-hydroxyacid substrates used for the screening of EC 1.1.3.15 sequence space.

The donor group is marked in red. (TIFF) Click here for additional data file.

Multiple sequence alignment of previously characterised representatives of the FMN-dependent 2-hydroxyacid oxidase/dehydrogenase family and proteins characterised in the study.

Conserved residues around the active site are circled in red. Sequence of predicted heme binding domain is highlighted in green, the elongated loop 4 is highlighted in blue. MSA performed in PROMALS3D (1) and visualised with Multiple Align Show (https://bioinformatics.org/sms/). (TIFF) Click here for additional data file.

Cytochrome c reduction assay of putative flavocytochrome b2 proteins.

Increase of signal at the wavelength of 550 nm indicates reduction of cytochrome c and protein activity. (TIFF) Click here for additional data file.

Exploration of alternative activities of selected proteins.

Presence of activity is marked with a dark purple square. (A) glycerol-3-phosphate dehydrogenase activity screen (B) 2-hydroxyglutarate dehydrogenase activity screen. (TIFF) Click here for additional data file.

Comparison of sensitivity of Amplex Red and 2,6-dichlorophenolindophenol (DCPIP)-based assays.

(A) Standard curves of resorufin, a product of Amplex Red-based assay (upper panel) and DCPIP (lower panel). Indicated by asterisk are concentrations of detection limit, as calculated by Anova single factor test (0.76 nM resorufin, 1.56 μM DCPIP). (B) Reaction rates of selected enzymes with the two electron acceptors, normalised to the reaction rate with DCPIP. Error bars in all figures represent standard deviation of the data obtained with three replicates. (TIFF) Click here for additional data file.

Characterisation of proteins with activities alternative to 1.1.3.15.

(A) SDS-PAGE gel of purified proteins chosen for kinetic characterisation. (B) Kinetic curves of the characterised enzymes. Error bars show standard error of three replicates. (TIFF) Click here for additional data file.

Test to find best k-tuple algorithm settings.

Using 400 randomly selected protein sequences all pairwise distances were calculated using different word size and distance measures. These distances were compared to distances computed using pairwise alignments. Appropriate k-tuple settings will cause points to lie on a diagonal, thus showing a high degree of correlation with the alignment-based values. Spearman’s rho and p-value is indicated for each plot. (TIFF) Click here for additional data file.

Average similarity between 400 randomly selected sequences from EC 1.1.3.15 (left panel), using k-tuple scores and pairwise alignments (right panel).

The k-tuple score was computed using a word size of 3 and google as a distance measure. The mean alignment-based identity is 18%. The total number of pairwise similarities is indicated, corresponding to half of the identity matrix, without the diagonal. (TIFF) Click here for additional data file.

Results of a functional prediction scan of sequences annotated to EC 1.1.3.15 in BRENDA 2017.1 using HAMAP and EggNOG servers.

(XLSX) Click here for additional data file. 4 Jun 2021 Dear Dr. Engqvist, Thank you very much for submitting your manuscript "Experimental and computational investigation of enzyme functional annotations reveals extensive annotation error" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. While I ask that you respond carefully to comments made by all reviewers, I would like you to pay particular attention to criticism and suggestions from reviewers 1 and 4 who call for a more in depth and balanced discussion of existing in silico annotation, within and outside of the BRENDA database and, in particular, for a more careful consideration of the different sources/criteria of/for these annotations and thus of their different level of expected reliability. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Marco Punta Associate Editor PLOS Computational Biology Arne Elofsson Deputy Editor PLOS Computational Biology *********************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: BACKGROUND This paper addresses an important question that concerns many scientists who have the responsibility of maintaining databases of annotated proteins. How can a fair and complete description of the function of all of the proteins in the database be provided when so few of the proteins have been experimentally characterised? GENERAL COMMENTS The paper includes a section on the synthetic construction of a number of genes for uncharacterised proteins reported in the BRENDA database to have activity as EC 1.1.3.15. I found this section well argued and the authors showed an awareness of the limitations of the experimental techniques they were using. The other half of the paper discusses what our response should be, given that the experimental data gathered by the authors shows that many of the uncharacterised proteins that they examined had a different activity from the one annotated in the BRENDA database. I have several misgivings about the line of argument followed in this section. I found the description of the process of experimental characterisation of proteins, and the consequent opportunity to propagate annotation to uncharacterised proteins lacked an understanding of the true state of play currently in the annotation of proteins. There was trenchant criticism of the annotations in one database (BRENDA) leading to a suggestion that all automated annotation of protein function is essentially flawed. This is not a correct assessment of the current state of play, and undermines the cooperative relationship that exists between those who provide experimental evidence of protein function, and those who use bioinformatic approaches to propagate this information to uncharacterised proteins. The authors seem unaware of the wider discussion about how to propagate annotation by sequence similarity. See, for instance, Pearson WR, Protein Function Prediction: Problems and Pitfalls. Curr. Protoc. Bioinform. 51:4.12.1-4.12.8. doi: 10.1002/0471250953.bi0412s51. The authors do compare their data with domain annotation provided by Pfam, but there is no mention of InterPro or the InterPro member databases (including Pfam) which try to generate models that capture protein families and not just domains. For instance, the HAMAP signature MF_00990 (https://hamap.expasy.org/rule/MF_00990) which is associated with EC 1.1.5.13 and which occurs in 210 of the 1414 records found in the supplementary file 1_1_3_15_BRENDA_sequences_filtered_2017_1.fasta. Agreed, there is only one characterised entry for MF_00990, but this family signature shows there are alternative annotations available for a good proportion of the BRENDA annotated EC 1.1.3.15 proteins. Also the Functional Families of the CATH database have representatives which touch the records the authors are looking at. Notably CATH 3.20.20.70, functional family 63 (http://www.cathdb.info/version/v4_3_0/superfamily/3.20.20.70/funfam/63/ec). This functional family includes 10 UniProt reviewed entries with EC 1.1.3.15 (all plant species) and 202 UniProt unreviewed entries which are also all plants. So there are routes to EC 1.1.3.15 prediction which respect the taxonomic imbalance of the experimentally characterised proteins. Building on from InterPro member databases that generate family signatures, there is the annotation effort of UniProt, which includes both curated rules (UniRule) and a fully automated prediction system (ARBA). The significance of this work is that it does not attempt to annotate everything, and limits itself to propagating annotation where there is consistent annotation of the characterised proteins, and a suitable sequence model that can be used to identify proteins of similar function. Out of the 1411 proteins listed in the Supplementary Material file 1_1_3_15_BRENDA_sequences_filtered_2017_1.fasta, and which map to entries in UniProt release 2021_02, UniRule annotates 12 plant proteins with EC 1.1.3.15 through ARBA00013087 (https://www.uniprot.org/arba/ARBA00013087), and 210 proteins from the Enterobacterales with EC 1.1.5.13 through MF_00990 (https://www.uniprot.org/unirule/UR000165710). It is true that BRENDA follows a policy of providing maximum coverage of annotation, as is made clear in the protocol for gathering annotation shown in Fig 1 of Quester and Schomburg BMC Bioinformatics 2011, 12:376 (http://www.biomedcentral.com/1471-2105/12/376). Also, the TrEMBL section of UniProt uncritically reports enzyme activities provided during submission of genome sequences until these are overwritten by a prediction system. These are indeed policies that the authors can validly criticise on the basis of their data. However, the best current approaches to annotation in the major protein databases do not attempt to annotate everything, but use a mixture of protein family signatures and taxonomic constraints to provide as good annotation as can be provided based on the current experimental data. This annotation clearly disagrees with both the extent and content of the annotation provided by BRENDA, showing there is a problem, even without the experimental data the paper provides. Focusing only on the shortcomings of BRENDA annotation and ignoring these other annotation approaches results in an unbalanced and misleading discussion. There is also a missed opportunity here: to target the selection of genes to synthesise to include representatives which are touched by the family signatures mentioned above, and thereby assess the reliability of these signatures. SPECIFIC REQUIREMENTS For me to be able to recommend this paper for publication, the discussion of the significance of the experimental data for BRENDA and other protein databases needs a good deal of revision. 1. The authors should not claim, either in the title or in the text, that their paper demonstrates a general failure of enzyme functional annotation in protein databases. They should also not claim that this is the first time that a significant mismatch has been shown between database annotation and subsequent experimental data. This has previously been thoroughly covered in a paper the authors quote (Schnoes et al., Plos Computational Biology, 11 Dec 2009, 5(12), DOI: 10.1371/journal.pcbi.1000605). The current paper is a continuation of this, focusing on BRENDA instead of KEGG. 2. The authors need to make it clear in the text that they understand that experimental evidence is for the foreseeable future going to remain in very short supply, and that there is a valid role for automated annotation of protein function. On page 24, paragraph 1, the authors slip into the error of suggesting that the only true answer to reliable annotation of proteins is for much greater prominence to be given to experimental evidences (and by implication, a much greater quantity of these). This is naive and shows a lack of appreciation of the size of the task of protein annotation, and the mutually reinforcing roles played by experimentation and prediction in maximising the number of well annotated proteins. 3. The authors need to show that they can distinguish the different qualities of annotation (arising from different approaches to automated annotation) that are found in protein databases, as described above. 4. The relationship between the authors' clustering method and multiple sequence alignment methods needs to be more clearly discussed. Curiously the data that the authors could be using to point to a better way forward in protein annotation is in the paper itself. The authors state on Page 6 line 14 that they wish to avoid grouping proteins based on multiple sequence alignment and use an approach from protein engineering. Having done this they then display the level of agreement between the clustering they have used and the presence of several different Pfam domains. For PF01070 and PF01266 the clustering the authors have used arrives at the same place that is provided by multiple sequence alignment and the building of sequence models. So there needs to be a better discussion of the two approaches to sequence clustering instead of the implied assertion that alignment-free comparisons are inherantly superior. As a side note, it seems clear from the method paper quoted by the authors (https://doi.org/10.1038/s41592-019-0598-1), that the clustering approach used is likely to be too computationally expensive to be implemented in the major protein databases. So it provides a useful way of checking the results achieved by methods based on multiple sequence alignment, but is not an approach that could replace them. 5. The length of the discussion on annotation in BRENDA over time should be significantly reduced, as this point has already been well made for KEGG previously in the paper by Schnoes et al. Finally, I did appreciate the provision of a full set of supplementary data which was very helpful for carrying out this review. PROOF READING These are suggestions for the authors to consider. Some are in sections of the manuscript that I am asking to be extensively revised, and should not be taken to indicate that I agree with the text. Page 3 line 1, Change to: 'utilization of functional gene diversity' (delete 'the') Page 3 line 5, Change to: 'at least 78% of the sequences' (add 'of the') Page 4 line 12, 'initiatives were undertaken' change to 'initiatives have been undertaken' Page 4 line 12, Change to: 'bring together computational and' (delete 'the') Page 5 line 2, 'estimated the annotation error between' change to 'estimated the annotation error to be between' Page 5 line 3, 'depending on a protein' change to 'depending on the protein' Page 5 line 14, 'BRENDA DB' change to 'the BRENDA database' (For consistency with how BRENDA is referenced elsewhere.) Page 6 line 16, '17 of these' Most journals don't allow you to start a sentence with a number in digits. Page 6 line 17, Change to: 'evidence at the protein level' (add 'the') Page 7 line 15, Change to: 'identity to previously characterised' (delete 'the') Page 8 line 3, Change to: 'Pfam domain architecture' (Delete 'Predicted'. I suggest this is redundant as Pfam is a prediction in its nature.) Page 10 line 8, Change to: 'all the members of the FMN-dependent' (add 'the') Page 10 line 10, 'Fig 2C)' (Missing opening bracket.) Page 10 line 19, Change to: 'Amplex Red assay, the four' (Add a comma.) Page 13 line 13, Change to: 'the Pfam [30] domains' (Delete 'predicted'. I suggest this is redundant as Pfam is a prediction in its nature.) Page 15 line 1, Change 'proved to show' to 'showed' Page 16 line 2, Change 'indicates' to 'indicating' Page 16 line 3, Change to: 'marked with squares; for proteins' (A semicolon instead of a comma, or a new sentence.) Page 16 line 12, Change to: 'Comparison of Pfam domains' (Delete 'predicted'. I suggest this is redundant as Pfam is a prediction in its nature.) Page 16 Table 1, Three decimal places is very unlikely to be justified by the data. I suggest two. Page 17 line 4, Change to: 'we compared Pfam domains' (Delete 'predicted'. I suggest this is redundant as Pfam is a prediction in its nature.) Page 21 Table legend, Change 'BRENDA DB' to 'the BRENDA database' (For consistency with how BRENDA is referenced elsewhere.) Page 22 line 3, Change to: 'In contrast to previous studies' (delete 'the') Page 24 line 10, Change to: 'called on authors' (add 'on') Page 24 last line, Change 'will be of much higher standards.' to 'will be of a much higher standard.' Page 26 line 6, Change to: 'the source organism's name in the NCBI' (Add apostrophe and 'the'.) Page 27 line 10-13, 'The Shannon ... each cluster.' The meaning of this sentence is unclear to me. Page 28 line 2, Change to: 'carried out' (Add 'out'.) Page 28 line 5, Change to: ' was expressed three times; a sequence' (A semicolon instead of a comma, or a new sentence.) Page 28 line 6, Change to: 'The soluble fraction of' (Add 'The'.) Page 28 line 8, Change to: ' activity screen and determination of kinetic parameters.' (Rephrased) Page 28 line 10, Change to: 'To screen for S-2-hydroxyacid' (Delete 'the') Page 28 line 16, Is HRP a permitted abbreviation or does horse radish peroxidase need to be given somewhere? Page 28 line 17, Change to: 'volume was 20 ul, and the assay' (Add 'and') Page 28 line 20, Change 'triplicates' to 'triplicate' Page 28 line 21, Change to: 'Values for non-specific activity in the absence of substrate were subtracted from experimental measurements.' (Rephrased) Page 29 line 4, '1 ul of purified' Most journals don't allow you to start a sentence with a number in digits. Page 29 line 6, Change to: 'L-Lactate' (Capitalisation of word.) Page 29 line 8, Change to: 'tested with the following' (Add 'the'.) Page 29 line 9, Change to: '2-Hydroxyacid' (Capitalisation of word.) Page 29 line 11, Change 'triplicates' to 'triplicate' Page 29 line 11-13, Change 'in case of' to 'in the case of' (Four changes.) Page 29 line 14, Change to: 'monitored in controls lacking substrate, and the values were subtracted from experimental measurements.' (Rephrased) Page 29 line 18, Change to: 'used for the assays were:' (Add 'were'.) Page 29 line 22, 'Reaction rates ... electron acceptors' (This sentence is unclear to me.) Page 30 line 9, 'Change to: ' Within each EC class, sequences' (Add comma,) Page 30 line 17 Change to: 'highest k-tuple-based similarity, pairwise sequence' (Add comma.) Reviewer #2: This article addresses a very important question which, despite its ancient origin, remains timely. While the topic of pervasive annotation errors is very general, the authors chose to illustrate the situation with a specific case, that of S-2-hydroxyacid oxidases. This work is therefore of interest for metabolic engineering studies. However a biological justification of this choice would have improved readers' interest in this article. Overall, the work provides an interesting and well documented study on the general problem posed by percolation of wrong annotations in public open databases. While the authors rightly point out the dangerous situation we are facing, this is not novel knowledge. As a matter of fact, over the years, several works focused on misannotations. This should probably be further emphasized in the introduction of the article as this could help readers to uncover other, highly relevant, related approaches. It would benefit the work to provide the readers with references to earlier attempts to tackle the question, in particular PMID 12490449, for example. Reference to works such as PMID 17708678, 28525546 in complement to 29806194, that was cited by the authors, would also help readers to understand how attempts were made to remedy this situation, with not much success, unfortunately. As the authors remarked, there is a need to couple in silico analyses with predictions and experimental attempts to validate the predictions. Again, a biological justification of their choice would be welcome. As a matter of fact, in a way quite similar to that proposed by the authors in the present paper, Risler and co-workers, twenty years ago, developed a work that associated an original in silico approach with experiments meant to identify explicitly the enzyme activities predicted in their work. This bioinformatics/experimental work focused on the differences between arginases and agmatinases. It validated experimentally the predictions. This early attempt seems highly reminiscent of the present one (PMID 10931887). This early work pointed out that it can be expected that methods should differ when looking into activities that correspond to sequences that diverged recently or slowly (implying that amino acid changes are relevant, such as those involved in catalysis PMID: 31733177), or into sequences that diverged a very long time ago or rapidly (such as in virus evolution), where it is likely that only the global 3D structure is conserved, with only catalytic residues preserved (implying that 3D features found in insertions/deletions would be relevant). In this case, constructing phylogenies based on indel trees, might be relevant. This feature has been recently used to characterize relevant traits of the SARS-COV-2 descent, for example (PMID 33125064). The very specific case of 2-hydroxyglutarate oxidation, which is so important in a variety of regulatory or metabolic contexts would probably benefit from this approach in another work. This observations makes this reviewer regret that there is so little biologically relevant information discussed in this paper. After all, annotation is meant to help investigators to progress in their understanding of biological functions, and some comments about the consequences of wrong annotations associated to the class of enzymes studied in the present work would have been more than welcome (and would have increased considerably the visibility of the work...) Reviewer #3: In the manuscript by Rembeza and Engqvist the authors assess 122 representative sequences of those annotated using E.C. class as S-2-hydroxyacid oxidases and find, by inference to related sequences, that 78% of the class is misannotated. The extension of the analysis computationally using the BRENDA database shows a high percentage of miss-annotation within a class to enzymes sharing no similarity or domain architecture. These findings will be useful to the general community and those specializing in the areas of enzyme structure and function. The work is technically well performed and well presented. The authors may wish to consider the following points (page numbers from PDF for review and in order of appearance). Page: 8- The global initiatives cited should probably include the enzyme function initiative summarized in Biochemistry. 2011 Nov 22;50(46):9950-62. doi: 10.1021/bi201312u Page: 9- EC number is assigned by activity not by sequence. Here EC 1.1.3.15 was chosen for study, but the first section in Results immediately point out the dissimilarity in sequence and fold. However, this might actually be expected using EC versus CATH or SCOP database to start. To make the work more accessible to the general reader, the authors should include in the beginning of results or the introduction, a brief description of the EC classification system and point out that the structure and sequence would not necessarily be expected to be the same within an EC class, but that the function should be the same. However, enzymes annotated using sequence identity may have a similar fold at best, but may indeed have different functions. Page: 11- The authors state “Most sequences have little similarity with the characterised ones; 79% of sequences annotated as 1.1.3.15 share less than 25% sequence identity with the closest biochemically characterized sequence (Fig. 1B, Fig. S3). Furthermore, only 22.5% of the 1058 sequences are predicted to contain the FMN-dependent dehydrogenase domain (FMN_dh, PF01070) which is canonical for known 2-hydroxy acid oxidases (Fig. 1C). Can the authors posit at all how the annotations were originally made and what the mis-step was in that assignment? Was it merely using too low a threshold for sequence identity and then, as the authors later find, misannotations from old database versions perpetuated to newly added homologous sequences? Among the 24 proteins with the FMN-dh domain, for the proteins that were inactive, were studies performed to ensure that they were folded (ie. by CD or light scattering)? If not perhaps it should be described that the proteins were either misfolded or inactive. Table 1 - please give Km in M not mM values- Vmax is not useful- instead please give kcat/Km in M-1 s-1 so the reader does not need to calculate Page: 18- The authors state “Three of the four enzymes (D4MUV9, A0A077SBA9, S2DJ52) had substrate affinities in the micromolar range and high catalytic rates, strengthening the possibility that these may be the natural substrates. As noted for Table 1 the column with Vmax should be replaced with kcat/Km which should be used to discuss enzyme efficiency. I would say the vales here should not be described as high (the greatest being 7.8 x 10^4 M-1 s-1) but does approach that typically used as a cutoff for a physiologically relevant substrate ~ 1 x 10^5 M-1 s-1 (reference 4 in manuscript). Minor changes/typos Page: 10- The authors state “17 of these sequences are characterised enzymes: either listed in BRENDA [17] as experimentally tested or in SwissProt [1] as having experimental evidence at protein level.” Do not start sentence with a number. Page: 13- “…glycolate, lactate, 2-hydroxyoctanoate, 2-hydroxydecanoate, mandelate, 2-hydroxyglutarate" should read "... mandelate, and 2-hydroxyglutarate Page: 14- "Indeed, the B8MKR3 protein displayed the cytochrome b2 L-lactate dehydrogenase activity" should read "... protein displayed cytochrome b2 L-lactate dehydrogenase activity" Page: 26- "In the work by Schnoes et al., based on entries to public databases in 2006, only 3 % of all sequences were considered misannotated due to the lack of similarity to the golden standard of a superfamily, in our study we show that this number is likely much higher now." This is an awkward sentence although meaning is clear- first it is the gold not golden standard. Second this should be broken into two sentences. Page: 28- “Only one article postulated for annotation transfer”- I think this should read "Only one article proposed an annotation transfer" Page 28- “high-throughput experiments should also be developed, as though the depth of protein characterisation in such approaches is limited" should read "...also be developed, and though the depth of protein..." Reviewer #4: The authors present a very careful analysis of the functional assignments of enzymes as available from public databases. Taking S-2-hydroxyacid oxidases (EC 1.1.3.15) as an example they analyzed the sequences of more than 1000 proteins with a predicted activity of this class. Only 17 examples of these had an experimental characterization, 14 of them of eukaryotic origin. They found that almost 80% off the sequences assigned to this EC number have less than 25% sequence identity to the closest experimentally proven one, and only 22.5% are predicted to contain the FMN-dependent dehydrogenase domain. In fact five different Pfam domains were found among the sequences. They took 122 sequences to try an experimental check of their enzyme activity. Out of these 65% could be expressed in a soluble form and experimentally tested for the S-2-hydroxy acid oxidase activity with six different substrates. The expressed proteins containing the FMN_dh were checked for the enzyme activity, only partially with success. Out of the 41 expressed protein that do not have the FMN_dh domain most did not show an enzyme activity corresponding to the EC number 1.1.3.15, some of them displayed a related dehydrogenase activity, e.g. D-lactate dehydrogenase activity. Again, some of them were further purifiied for the determination of kinetic constants. All in all the authors came to the result, that those proteins that do not contain the canonical FMN-dh domain probably have other catalytic activities and their annotation as S-2-hydroxyacid oxidases (EC 1.1.3.15) is probably not correct. The authors have found that these probably misannotated proteins represent almost 80% of the sequences downloadable from Uniprot or BRENDA. In a final chapter the authors apply the results from their specific analysis to all annotated enzyme sequences downloadable from UNIPROT/BRENDA and find that the large majority of sequences show a sequence identity with the closest experimentally characterized representative of their EC-Class of more than 30%. The authors point out, that on the other hand 20% of the sequences share less than 25% pairwise sequence identity with the closest characterized enzyme in their own EC-class. Overall this is a very good and important paper, combining theoretical analyses with a large number of experimental data. In general it is well written. Nevertheless there are a number of misunderstandings/errors in interpretation that have to be corrected before publication. In particular the general conclusions in the discussion are not all justified. The following modifications and clarifications are absolutely essential: 1) As becomes obvious from the BRENDA publications the function assignment for the downloadable sequences is directly imported from UNIPROT. UNIPROT has two datasets: the SWISSPROT sequences (presently 23 sequences for 1.1.3.15), and the TREMBL dataset (presently 3590 sequences). On the interactive UNIPROT pages they are named “Reviewed” or “Unreviewed” and in the BRENDA downloadable data their source is given an SWISSPROT or TREMBL. Whereas the SWISSPROT data are manually checked by the UNIPROT scientists and should be highly reliable, the source of the functional assignment of TREMBL sequences is described in the SWISSPROT/TREMBL guide by the following “Automated annotation of the highest currently available quality is integrated to TrEMBL entries.” Only for those enzymes where experimental data are available in BRENDA the EC numbers were manually checked and sometimes corrected by the BRENDA team. For the EC class 1.1.3.15 these are presently 15 sequences. So, these annotation errors appear as errors in the BRENDA database but in fact are really errors in UNIPROT. This should be obvious because the source of the sequence/annotation is given in the downloadable file. 2) The label “Evidence on protein level” for UNIPROT sequences does not mean that there is experimental evidence for the function but only for the existence of a protein. Again, the UNIPROT guide says: “The value 'Experimental evidence at protein level' indicates that there is clear experimental evidence for the existence of the protein. The criteria include partial or complete Edman sequencing, clear identification by mass spectrometry, X-ray or NMR structure, good quality protein-protein interaction or detection of the protein by antibodies.” For the description of the protein function there are the so-called evidence codes in UNIPROT. One can find an explanation of the different codes here: http://www.uniprot.org/help/evidences. In short, 255 is added by automatic procedures, 250 is by similarity, and 269 is experimental evidence. 3) Looking at Figure 1 B one really does not get the impression that 80% of the sequences have less than 25% identity to the closest experimentally tested one. One would guess from the figure that this is only true for about 20%! This must be checked. 4) In the general discussion the authors say “Strikingly, in each of the superkingdoms almost one fifth of sequence share less than 25% pairwise identity with the closes characterized enzyme.” This statement as it stands gives the expression that this is unexpected - which is not really true. An EC-class is assigned to a protein based on its enzymatic function, i.e. the catalyzed reactions and their substrate specificity if there are clear differences observed. For example, for EC 1.1.3.15 the IUBMB description says: “A flavoprotein (FMN). Exists as two major isoenzymes; the A form preferentially oxidizes short-chain aliphatic hydroxy acids, and was previously listed as EC 1.1.3.1, glycolate oxidase; the B form preferentially oxidizes long-chain and aromatic hydroxy acids. The rat isoenzyme B also acts as EC 1.4.3.2, L-amino-acid oxidase. “ It could be, and this is indeed very often the case, that there are several sequence families that have the same EC number and enzymatic function. So, these mentioned assignments could still be true, even if in UNIPROT and BRENDA no experimental data for these low-identity proteins are mentioned. Due to the limited manpower in UNIPROT and BRENDA only a small subset of papers can be annotated. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes Reviewer #4: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No Reviewer #4: Yes: Dietmar Schomburg Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols 18 Aug 2021 Submitted filename: reviewer_response.pdf Click here for additional data file. 13 Sep 2021 Dear Dr. Engqvist, We are pleased to inform you that your manuscript 'Experimental and computational investigation of enzyme functional annotations uncovers misannotation in the EC 1.1.3.15 enzyme class' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Marco Punta Associate Editor PLOS Computational Biology Arne Elofsson Deputy Editor PLOS Computational Biology *********************************************************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors have addressed the issues that I raised in my earlier review, and have made changes to the manuscript that I think are appropriate. I thank them for the careful attention they gave to the points I made. Reviewer #2: The authors took into account essentially all my comments. No further questions Reviewer #3: The authors have done a thorough job addressing the concerns raised in my review. It is important for the general reader that there is a better approach to presenting the expectations when examining enzymes with the same E.C. number. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: None Reviewer #3: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No 17 Sep 2021 PCOMPBIOL-D-21-00508R1 Experimental and computational investigation of enzyme functional annotations uncovers misannotation in the EC 1.1.3.15 enzyme class Dear Dr Engqvist, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Amy Kiss PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

62 in total

1. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

Authors: Weizhong Li; Adam Godzik
Journal: Bioinformatics Date: 2006-05-26 Impact factor: 6.937

2. Revealing the hidden functional diversity of an enzyme family.

Authors: Karine Bastard; Adam Alexander Thil Smith; Carine Vergne-Vaxelaire; Alain Perret; Anne Zaparucha; Raquel De Melo-Minardi; Aline Mariage; Magali Boutard; Adrien Debard; Christophe Lechaplais; Christine Pelle; Virginie Pellouin; Nadia Perchat; Jean-Louis Petit; Annett Kreimeyer; Claudine Medigue; Jean Weissenbach; François Artiguenave; Véronique De Berardinis; David Vallenet; Marcel Salanoubat
Journal: Nat Chem Biol Date: 2013-11-17 Impact factor: 15.040

3. Cyanobacterial lactate oxidases serve as essential partners in N2 fixation and evolved into photorespiratory glycolate oxidases in plants.

Authors: Claudia Hackenberg; Ramona Kern; Jan Hüge; Lucas J Stal; Yoshinori Tsuji; Joachim Kopka; Yoshihiro Shiraiwa; Hermann Bauwe; Martin Hagemann
Journal: Plant Cell Date: 2011-08-09 Impact factor: 11.277

Review 4. The Enzyme Function Initiative.

Authors: John A Gerlt; Karen N Allen; Steven C Almo; Richard N Armstrong; Patricia C Babbitt; John E Cronan; Debra Dunaway-Mariano; Heidi J Imker; Matthew P Jacobson; Wladek Minor; C Dale Poulter; Frank M Raushel; Andrej Sali; Brian K Shoichet; Jonathan V Sweedler
Journal: Biochemistry Date: 2011-10-26 Impact factor: 3.162

5. Structure and role for active site lid of lactate monooxygenase from Mycobacterium smegmatis.

Authors: Kelsey M Kean; P Andrew Karplus
Journal: Protein Sci Date: 2018-10-03 Impact factor: 6.725

6. Genome annotation errors in pathway databases due to semantic ambiguity in partial EC numbers.

Authors: M L Green; P D Karp
Journal: Nucleic Acids Res Date: 2005-07-20 Impact factor: 16.971

7. BRENDA in 2019: a European ELIXIR core data resource.

Authors: Lisa Jeske; Sandra Placzek; Ida Schomburg; Antje Chang; Dietmar Schomburg
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

8. Estimating the annotation error rate of curated GO database sequence annotations.

Authors: Craig E Jones; Alfred L Brown; Ute Baumann
Journal: BMC Bioinformatics Date: 2007-05-22 Impact factor: 3.169

9. COMBREX-DB: an experiment centered database of protein function: knowledge, predictions and knowledge gaps.

Authors: Yi-Chien Chang; Zhenjun Hu; John Rachlin; Brian P Anton; Simon Kasif; Richard J Roberts; Martin Steffen
Journal: Nucleic Acids Res Date: 2015-12-03 Impact factor: 16.971

10. The Pfam protein families database in 2019.

Authors: Sara El-Gebali; Jaina Mistry; Alex Bateman; Sean R Eddy; Aurélien Luciani; Simon C Potter; Matloob Qureshi; Lorna J Richardson; Gustavo A Salazar; Alfredo Smart; Erik L L Sonnhammer; Layla Hirsh; Lisanna Paladin; Damiano Piovesan; Silvio C E Tosatto; Robert D Finn
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

3 in total

1. Discovery of Two Novel Oxidases Using a High-Throughput Activity Screen.

Authors: Elzbieta Rembeza; Alessandro Boverio; Marco W Fraaije; Martin K M Engqvist
Journal: Chembiochem Date: 2021-11-18 Impact factor: 3.461

2. Investigation and Alteration of Organic Acid Synthesis Pathways in the Mammalian Gut Symbiont Bacteroides thetaiotaomicron.

Authors: Nathan T Porter; Johan Larsbrink
Journal: Microbiol Spectr Date: 2022-02-23

3. Evolution of Protein Functional Annotation: Text Mining Study.

Authors: Ekaterina V Ilgisonis; Pavel V Pogodin; Olga I Kiseleva; Svetlana N Tarbeeva; Elena A Ponomarenko
Journal: J Pers Med Date: 2022-03-16

3 in total