| Literature DB >> 21558170 |
Marc Röttig1, Marnix H Medema, Kai Blin, Tilmann Weber, Christian Rausch, Oliver Kohlbacher.
Abstract
The products of many bacterial non-ribosomal peptide synthetases (NRPS) are highly important secondary metabolites, including vancomycin and other antibiotics. The ability to predict substrate specificity of newly detected NRPS Adenylation (A-) domains by genome sequencing efforts is of great importance to identify and annotate new gene clusters that produce secondary metabolites. Prediction of A-domain specificity based on the sequence alone can be achieved through sequence signatures or, more accurately, through machine learning methods. We present an improved predictor, based on previous work (NRPSpredictor), that predicts A-domain specificity using Support Vector Machines on four hierarchical levels, ranging from gross physicochemical properties of an A-domain's substrates down to single amino acid substrates. The three more general levels are predicted with an F-measure better than 0.89 and the most detailed level with an average F-measure of 0.80. We also modeled the applicability domain of our predictor to estimate for new A-domains whether they lie in the applicability domain. Finally, since there are also NRPS that play an important role in natural products chemistry of fungi, such as peptaibols and cephalosporins, we added a predictor for fungal A-domains, which predicts gross physicochemical properties with an F-measure of 0.84. The service is available at http://nrps.informatik.uni-tuebingen.de/.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21558170 PMCID: PMC3125756 DOI: 10.1093/nar/gkr323
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Prediction levels and predictor quality (bacterial)
| Classname | Members | Type | NRPSpredictor2 | NRPSpredictor1 | ||
|---|---|---|---|---|---|---|
| Prec. | Rec. | |||||
| Three class | ||||||
| Hydrophobic aliphatic | Ala, Gly, Val, Leu, Ile, Abu, Iva Ser, Thr, Hpg, Dhpg, Cys, Pro, Pip | W,R,T | 0.974 | 0.974 | 0.974 | – |
| Hydrophilic | Arg, Asp, Glu, His, Asn, Lys, Gln, Orn, Aad | W,R,T | 0.940 | 0.940 | 0.940 | – |
| Hydrophobic aromatic | Phe, Tyr, Trp, Dhb, Phg, Bht | W,R,T | 0.890 | 0.889 | 0.892 | – |
| Large clusters | ||||||
| Hydroxy-benzoic acid derivates | Dhb, Sal | W,R,T | 0.982 | 1.000 | 0.967 | 0.982 |
| Polar, uncharged (aliphatic with -SH) | Cys | R,R,T | 0.976 | 0.975 | 0.975 | 0.954 |
| Aliphatic chain or phenyl group with -OH | Ser, Thr, Dhpg, Hpg | R,R,T | 0.968 | 0.967 | 0.969 | 0.963 |
| Aliphatic chain with H-bond donor | Asp, Asn, Glu, Gln, Aad | W,R,C | 0.958 | 0.969 | 0.950 | 0.942 |
| Apolar, aliphatic | Gly, Ala, Val, Leu, Ile, Abu, Iva | W,R,T | 0.940 | 0.947 | 0.934 | 0.940 |
| Aromatic side chain | Phe, Trp, Phg, Tyr, Bht | W,R,T | 0.881 | 0.881 | 0.881 | 0.881 |
| Cyclic aliphatic chain (polar NH2 group) | Pro, Pip | R,R,T | 0.867 | 0.867 | 0.867 | 0.811 |
| Long positively charged side chain | Orn, Lys, Arg | W,R,T | 0.864 | 0.898 | 0.833 | 0.861 |
| Ø | 0.930 | – | – | 0.917 | ||
| Small clusters | ||||||
| 2-amino-adipic acid | Aad | W,L,C | 1.000 | 1.000 | 1.000 | 1.000 |
| Dhb, Sal | Dhb, Sal | W,L,C | 1.000 | 1.000 | 1.000 | 0.940 |
| Polar, uncharged (hydroxy-phenyl) | Dhpg, Hpg | R,L,T | 1.000 | 1.000 | 1.000 | 0.981 |
| Cys | Cys | R,L,T | 0.983 | 0.983 | 0.983 | 0.950 |
| Serine-specific | Ser | W,R,T | 0.972 | 1.000 | 0.947 | 0.936 |
| Threonine-specific | Thr | W,L,C | 0.969 | 0.978 | 0.961 | 0.942 |
| Asp-Asn | Asp, Asn | W,L,C | 0.948 | 0.969 | 0.931 | 0.942 |
| Orn and hydroxy- Orn specific | Orn | R,L,T | 0.900 | 0.900 | 0.900 | 0.800 |
| Aliphatic, branched hydrophobic | Val, Leu, Ile, Abu, Iva | W,R,T | 0.893 | 0.892 | 0.895 | 0.887 |
| Tiny, hydrophilic, transition to aliphatic | Gly, Ala | W,L,C | 0.886 | 0.938 | 0.843 | 0.859 |
| Pro-specific | Pro | R,L,T | 0.882 | 0.938 | 0.833 | 0.900 |
| Polar aromatic ring | Tyr, Bht | W,R,T | 0.857 | 0.892 | 0.825 | 0.793 |
| Glu-Gln | Glu, Gln | W,L,C | 0.813 | 0.850 | 0.791 | 0.860 |
| Arg-specific | Arg | W,L,C | 0.740 | 1.000 | 0.600 | 0.800 |
| Unpolar aromatic ring | Phe, Trp | W,L,C | 0.538 | 0.608 | 0.500 | 0.671 |
| Ø | 0.892 | – | – | 0.884 | ||
| Single substrates | ||||||
| Aad | Aad | W,R,T | 1.000 | 1.000 | 1.000 | – |
| Cys | Cys | R,R,T | 1.000 | 1.000 | 1.000 | – |
| Hpg | Hpg | R,R,T | 0.974 | 1.000 | 0.950 | – |
| Ser | Ser | W,R,T | 0.962 | 0.993 | 0.933 | – |
| Thr | Thr | W,R,T | 0.949 | 0.976 | 0.922 | – |
| Dhb | Dhb | W,R,T | 0.947 | 1.000 | 0.900 | – |
| Dhpg | Dhpg | W,R,T | 0.943 | 0.967 | 0.925 | – |
| Asn | Asn | R,R,T | 0.939 | 0.934 | 0.944 | – |
| Orn | Orn | R,R,T | 0.933 | 0.933 | 0.933 | – |
| Ile | Ile | R,R,T | 0.918 | 1.000 | 0.850 | – |
| Gly | Gly | R,R,T | 0.906 | 0.902 | 0.910 | – |
| Ala | Ala | W,R,T | 0.878 | 0.901 | 0.856 | – |
| Arg | Arg | W,R,T | 0.833 | 0.833 | 0.833 | – |
| Iva | Iva | W,R,T | 0.814 | 0.933 | 0.725 | – |
| Val | Val | W,R,T | 0.801 | 0.828 | 0.777 | – |
| Leu | Leu | W,R,T | 0.784 | 0.782 | 0.787 | – |
| Pro | Pro | W,R,T | 0.755 | 0.792 | 0.722 | – |
| Bht | Bht | W,R,T | 0.717 | 0.782 | 0.675 | – |
| Glu | Glu | R,R,T | 0.704 | 0.760 | 0.657 | – |
| Pip | Pip | W,R,T | 0.700 | 0.800 | 0.625 | – |
| Asp | Asp | R,R,T | 0.700 | 0.700 | 0.700 | – |
| Tyr | Tyr | W,R,T | 0.696 | 0.671 | 0.725 | – |
| Gln | Gln | W,R,T | 0.689 | 0.775 | 0.620 | – |
| Phe | Phe | W,R,T | 0.688 | 0.740 | 0.643 | – |
| Lys | Lys | R,R,T | 0.400 | 0.500 | 0.333 | – |
| Trp | Trp | W,R,T | 0.320 | 0.400 | 0.267 | – |
The column type gives the best performing predictor encoded by three letters: the first letter represents the used encoding (W: Wold, R: Rausch), the second letter the used kernel (L: linear, R: RBF) and the third letter the used SVM type (C: classical SVM T: transductive SVM). The columns F, Prec. and Rec. give the F-measure, Precision and Recall of the best predictor, respectively. Aad: 2-amino-adipic-acid; Bht: beta-hydroxy-tyrosine; Hpg: 4-hydoxy-phenyl-glycine; Dhb: 2,3-dihydroxy-benzoic acid; Dhpg: 3,5-dihydroxy-phenyl-glycin; Iva: isovaline; Orn: ornitine; Pip: pipecolic acid; Sal: salicylic acid.
Figure 1.NRPSpredictor2 prediction report for one extracted A-domain. On top, the ID of the parent sequence, location of the A-domain within the sequence and the bit score of the PFAM-HMM are given. The green checkmark signals that the signature sequence lies within the applicability domain of the model. The extracted 8 Å signature and Stachelhaus code are given directly below. Subsequently, the list of predictions is given along with the score of the respective SVM predictors. For each predictor we also report the reliability of that predictor as determined during model validation. The last row gives the nearest sequence neighbor in the NRPSpredictor2 database (based on Stachelhaus code) and the respective sequence identity.