| Literature DB >> 23637983 |
Barzan I Khayatt1, Lex Overmars, Roland J Siezen, Christof Francke.
Abstract
There is a growing interest in the Non-ribosomal peptide synthetases (NRPSs) and polyketide synthases (PKSs) of microbes, fungi and plants because they can produce bioactive peptides such as antibiotics. The ability to identify the substrate specificity of the enzyme's adenylation (A) and acyl-transferase (AT) domains is essential to rationally deduce or engineer new products. We here report on a Hidden Markov Model (HMM)-based ensemble method to predict the substrate specificity at high quality. We collected a new reference set of experimentally validated sequences. An initial classification based on alignment and Neighbor Joining was performed in line with most of the previously published prediction methods. We then created and tested single substrate specific HMMs and found that their use improved the correct identification significantly for A as well as for AT domains. A major advantage of the use of HMMs is that it abolishes the dependency on multiple sequence alignment and residue selection that is hampering the alignment-based clustering methods. Using our models we obtained a high prediction quality for the substrate specificity of the A domains similar to two recently published tools that make use of HMMs or Support Vector Machines (NRPSsp and NRPS predictor2, respectively). Moreover, replacement of the single substrate specific HMMs by ensembles of models caused a clear increase in prediction quality. We argue that the superiority of the ensemble over the single model is caused by the way substrate specificity evolves for the studied systems. It is likely that this also holds true for other protein domains. The ensemble predictor has been implemented in a simple web-based tool that is available at http://www.cmbi.ru.nl/NRPS-PKS-substrate-predictor/.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23637983 PMCID: PMC3630128 DOI: 10.1371/journal.pone.0062136
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Frequency representations of conserved residues in the AT- and A-domain.
A) the active site residues extracted for the AT domain and B) the 10 core motifs within the A domain. The representations were made using Weblogo [52] on basis of the multiple sequence alignment of all domains in the collected dataset and the 13 active site residues identified by [24] (i.e. 11, 63, 90–94, 117, 200, 201, 231, 250, 255) for the AT domain and the 10 core motifs identified by [54] for the A domain.
AT domain classification on basis of the NJ-algorithm for various selected sets of residues.
| AT Domain Substrate$ | Complete this study | 13 residues Serre et al. | 23 residues Yadav et al. | 92 residues Minowa et al. | 165 or 146a* selected residues | 37b* selected residues |
| MC (92) | 1∼ | 0.96∼ | 0.95∼ | 0.98∼ | 0.90∼ | 1∼ |
| MMC (83) | 1∼ | 1∼ | 1∼ | 1∼ | 0.96∼ | 1∼ |
| 2MBuC (2) | nsc | 1 | 1 | nsc | 1 | 1 |
| IBuC (3) | 0.66 | 0.66 | 0.66 | 0.66 | 0.66 | 1 |
| PC (3) | 1 | 1 | 1 | 1 | 1 | 1 |
| MOMC (12) | nsc | nsc | nsc | nsc | nsc | 1 |
| EMC (12) | nsc | nsc | nsc | nsc | nsc | nsc |
The first column lists the different substrate groups and gives the number of represented sequences between brackets. The values in columns 3, 4 and 5 were calculated on basis of the residues identified by [24], [25] and [35], as indicated. The two major substrate groups MC (malonyl-CoA) and MMC (methylmalonyl-CoA) were reasonably well distinguishable in all trees. However, the factual accuracy of the MC and MMC prediction is lower than 1 as all of the ‘minor’ substrate specific AT sequences fall within the both clusters. Abbreviation: nsc, not in a single cluster.
$ For substrate abbreviations see the legend of Figure 2. The initial complete dataset was used to compose the Table (i.e. including the near duplicate sequences), excluding the sequences related to BzC (2), 3MbuC (1), AC (1), CH (1), and CP (1).
a* 165 conserved positions (100% identity) in at least one of the substrate groups; 146 conserved positions in case the residues are removed that are conserved throughout all substrate groups; b* Conserved positions (100% identity) in at least three of the substrate groups (do not include global identical).
Figure 2Frequency representation of the active site residues within the AT domain per substrate.
The Malonyl CoA (MC) specific AT domain can be separated from the rest on basis of a clearly distinct conserved residues (box A) and likewise can the 2-Methylbuteryl-CoA (2MBuC), the Benzoyl-CoA (BzC), the Isobuteryl-CoA (IBuC) and the Propionyl-CoA (PC) specific AT domains (box B); For the MMC, Methylmalonyl-CoA (MMC), the Ethylmalonyl-CoA (EMC) and the Methoxymalonyl-CoA (MOMC) specific AT domains the conserved active site residues are almost indistinguishable (box C). The sequence representations were made using weblogo [52].
Figure 3Frequency representation of the active site residues within the A domain per substrate.
A) The A-domains were clustered according to common conserved residues as indicated by black boxes) (see e.g. [33]). B) The newly identified substrates have been placed on basis of the motif. For proteinogenic amino acids the three-letter code was used. The non-proteinogenic amino acids are indicated by the following abbreviations: aad, 2-amino-adipic acid; abu, 2-amino-butyric acid; allo-thr, allo-threonine; B-ala, beta-alanine; bht, beta-hydroxy-tyrosine; B-lys, beta-lysine; bmt, (4R)-4[(E)-2-butenyl]-4-methyl-L-threonine; dab, 2,4-diamino-butyric acid; dhab, 2,3-dehydroaminobutyric acid; dhb, 2,3-dihydroxy-benzoic acid; dhpg = dpg, 3,5-dihydroxy-phenyl-glycine; dht, dehydro-threonine = dhbu = 2,3-dehydroaminobutyric acid; fN5horn, N5-hydroxyornithine; hpg, 4-hydoxy-phenyl-glycine; hpg2Cl, 3,5-dichloro-4-hydroxy-L-phenylglycine; iva, isovaline; masp, methyl-aspartate; mpro, methyl-proline; orn, ornithine; pheac, phenylacetate; pip, pipecolic acid; sal, salicylic acid; sar, sarcosine. The sequence representations were made using Weblogo [52].
AT domain classification on basis of HMMs.
| single HMMs | ensemble of HMMs | ensemble of HMMs LOO | |||||||
|
|
| f |
|
| f |
|
| f |
|
| MC (69) |
| 1 |
|
| 0 |
|
| 3 |
|
| MMC (63) |
| 0 |
|
| 0 |
|
| 1 |
|
| 2MBuC (2) |
| 0 |
|
| 0 |
|
| 2 |
|
| BzC (2) |
| 0 |
|
| 0 |
|
| 1 |
|
| IBuC (3) |
| 0 |
|
| 0 |
|
| 1 |
|
| EMC (11) |
| 2 |
|
| 0 |
|
| 7 |
|
| MOMC (10) |
| 0 |
|
| 1 |
|
| 4 |
|
| PC (3) |
| 0 |
|
| 0 |
|
| 0 |
|
| Other (4)# | 0 | 3 |
| 0 | 3 |
| |||
|
|
| 3.6 |
|
| 2.4 |
|
| 11.7 |
|
The first column lists the different substrates and between brackets the number sequences that were analyzed. The Table lists the number of correctly (c, bold) and falsely (f) classified sequences and the number of sequences that scored above threshold (at, grey and in italics). The values in columns 2, 3 and 4 were derived from the use of a single HMM per substrate, and the columns 5, 6 and 7 relate to the prediction made using an ensemble of multiple HMMs per substrate. The values in columns 8, 9 and 10 relate to the Leave One Out cross validation.
$ The set contained 167 non redundant sequences. See the legend of Figure 2 for the systematic name of the various substrates.
# The category ‘other’ sequences includes those specific for 3MbuC, AC, CH and CP as only one sequence has been experimentally identified and thus no reliable model could be made.
Quality of A domain substrate specificity predictions using HMMs and SVMs.
| data$ | correct | false | Above threshold | coverage | Correct of covered | |
| NRPSsp | P∩K’ |
| 7 |
| ||
| K | (77) |
| ||||
| NRPSpredictor2 | R∩K’ |
| 8 |
| ||
| K | (79) |
| ||||
| single HMMs | K’ |
| 4 |
| ||
| K | (88) |
| ||||
| ensemble HMMs | P∩K’ |
| 1 |
| ||
| R∩K’ |
| 3 |
| |||
| K’ |
| 2 |
| |||
| K |
| 4 |
| (96) |
| |
| P |
| 3 |
| (88) |
| |
| LOO |
| 13 |
| (79) |
|
Substrate specificity predictions were made for various sequence data-sets using the published tools NRPSsp [42], NRPSpredictor 2 [43], and our single and ensemble of HMMs. Column 1 indicates the predictor that was tested and Column 2 the data that was used to test. Columns 3 and 4 provide the percentage of correct and false predictions below the set threshold, respectively, and column 5 the percentage of predictions that scored above threshold. Column 6 gives the fraction of sequences from the complete non-redundant data-set that received an annotation. Column 7 provides the fraction of correctly annotated sequences within the set of sequences that was provided with an annotation.
$ To test the coverage and check the validity of the predictions, the four predictors were applied to the non-redundant reference dataset of experimentally validated substrate specific A domain sequences collected by us from the reference databases, literature and from [43] (set K = 571 sequences). To compare the performance, the predictors were applied to those sequences that are shared between data-sets. We found 392 sequences to be shared between the data-set used to train NRPSsp [42] and our non-redundant set (P∩K’), and 405 sequences to be shared between the data-set used to train NRPSpredictor2 [43] and our non-redundant set (R∩K’). In this case, K’ indicates that the sequences related to a substrate for which no model was present in either of the predictors, were left out in the comparison. The ensemble of HMMs was also applied to the dataset provided by [42] (P). To test the sensitivity of the ensemble models with respect to the removal of constituent sequences a Leave One Out cross validation was performed (LOO).
A domain classification with an ensemble of HMMs.
| ensemble HMMs | LOO | |||||
|
|
| f |
|
| f |
|
| aad (10) |
| 0 |
|
| 0 |
|
| abu, iva (17/12) |
| 1 |
|
| 3 |
|
| ala (46) |
| 1 |
|
| 8 |
|
| b-ala(4) # |
| 0 |
|
| 0 |
|
| arg (7) |
| 0 |
|
| 1 |
|
| asn (20) |
| 0 |
|
| 0 |
|
| asp (15) |
| 0 |
|
| 0 |
|
| bht (6) |
| 0 |
|
| 1 |
|
| bmt (2) |
| 0 |
|
| 2 |
|
| cys (27) |
| 0 |
|
| 2 |
|
| dab (10) # |
| 0 |
|
| 0 |
|
| dhab, dht (4) # |
| 0 |
|
| 0 |
|
| dhb, sal (12) |
| 0 |
|
| 0 |
|
| dhpg, dpg (8) |
| 0 |
|
| 0 |
|
| fN5H-orn (4) |
| 0 |
|
| 0 |
|
| gln (10) |
| 0 |
|
| 3 |
|
| glu (16) |
| 0 |
|
| 3 |
|
| gly (30) |
| 1 |
|
| 5 |
|
| his (2) |
| 0 |
|
| 0 |
|
| horn (3) # |
| 0 |
|
| 1 |
|
| hpg, hpg2Cl (21/15) |
| 0 |
|
| 0 |
|
| hyv-d (3) # |
| 0 |
|
| 0 |
|
| ile (13) |
| 0 |
|
| 3 |
|
| leu (41) |
| 4 |
|
| 7 |
|
| lys (8) |
| 0 |
|
| 0 |
|
| b-lys (3) |
| 0 |
|
| 0 |
|
| me-asp (4) |
| 0 |
|
| 0 |
|
| orn (12) |
| 0 |
|
| 1 |
|
| phe (15) |
| 1 |
|
| 5 |
|
| phe-ac (3) |
| 0 |
|
| 0 |
|
| pip (8) |
| 0 |
|
| 2 |
|
| pro, me-pro (20) |
| 0 |
|
| 1 |
|
| ser (33) |
| 3 |
|
| 4 |
|
| thr, allo-thr (34) |
| 0 |
|
| 2 |
|
| trp (14) |
| 0 |
|
| 2 |
|
| tyr (18) |
| 0 |
|
| 6 |
|
| val (34) |
| 1 |
|
| 4 |
|
| ambiguous (15) |
| 4 |
|
| - |
|
| other (19) | 0 | 4 |
|
| - |
|
The first column lists the different substrates and the number of sequences analyzed (between brackets). The second column lists the number of correctly classified sequences by our ensemble of HMMs, for the non-redundant reference dataset of experimentally validated substrate specific A domain sequences collected from reference databases, literature and from [43] (set K = 571 sequences). The third column gives the number of sequences that received a false annotation (f), and the fourth column gives the number of sequences that scored above treshold (at, grey and numbers in italics). Columns five, six and seven provide the same information but then related to the Leave One Out cross validation.
$ See the legend of Figure 3 for the systematic name of the various substrates. The category ‘other’ includes those substrates that are represented only once in the domain sequence dataset. They include: 2-oxo-isovaleric-acid, 3-methyl-glutamate (3-me-glu), 4-propyl-proline (4ppro), 2-amino-9,10-epoxy-8-oxodecanoic acid (aeo), alaninol, alle, alpha-hydroxy-isocaproic acid, an, (S)-2-amino-8-oxodecanoic acid (aoda), l-capreomycidine (cap), d-lysergic acid (d-lyserg), hydroxyl-asn, hmp-D, LDAP, MeHOval, N-methyl-phenylalanine (mephe), N-methyl valine (meval), N-(1,1-dimethyl-1-allyl)tryptophan, phenyl-glycine (phg), s-nmethoxy-tryptophan, (4S)-5,5,5-trichloro-leucine (tcl), valinol (vol).
, # and &: For particular substrates no representative models were present in one or more of the predictors that were compared in Table 3 (*, NRPSsp; #, NRPS predictor 2; &, ensemble HMMs). Ideally the related sequences should obtain a score above threshold.