| Literature DB >> 16221976 |
Christian Rausch1, Tilmann Weber, Oliver Kohlbacher, Wolfgang Wohlleben, Daniel H Huson.
Abstract
We present a new support vector machine (SVM)-based approach to predict the substrate specificity of subtypes of a given protein sequence family. We demonstrate the usefulness of this method on the example of aryl acid-activating and amino acid-activating adenylation domains (A domains) of nonribosomal peptide synthetases (NRPS). The residues of gramicidin synthetase A that are 8 A around the substrate amino acid and corresponding positions of other adenylation domain sequences with 397 known and unknown specificities were extracted and used to encode this physico-chemical fingerprint into normalized real-valued feature vectors based on the physico-chemical properties of the amino acids. The SVM software package SVM(light) was used for training and classification, with transductive SVMs to take advantage of the information inherent in unlabeled data. Specificities for very similar substrates that frequently show cross-specificities were pooled to the so-called composite specificities and predictive models were built for them. The reliability of the models was confirmed in cross-validations and in comparison with a currently used sequence-comparison-based method. When comparing the predictions for 1230 NRPS A domains that are currently detectable in UniProt, the new method was able to give a specificity prediction in an additional 18% of the cases compared with the old method. For 70% of the sequences both methods agreed, for <6% they did not, mainly on low-confidence predictions by the existing method. None of the predictive methods could infer any specificity for 2.4% of the sequences, suggesting completely new types of specificity.Entities:
Mesh:
Substances:
Year: 2005 PMID: 16221976 PMCID: PMC1253831 DOI: 10.1093/nar/gki885
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1Module and domain structure of NRPS. Above (in the middle): one complete NRPS consisting of three modules. Below: enzymatic domains that are contained in a complete module: Cond: condensation domain, Adenyl: adenylation domain (A domain), N-Meth: N-methylation domain (optional, does not appear in all NRPS), PCP: thiolation domain (Peptidyl Carrier Protein domain), Epi: epimerization domain (optional). Other optional domains are heterocyclation, oxidation, reduction and formylation domains. The substrate specificity of the adenylation domain is the subject of this study.
Figure 2Phenylalanine bound to gramicidin synthetase A activation domain. (a) The 10 residues (green) that are in direct contact with the substrate phenylalanine (ball and stick representation) are shown. These 10 residues are the basis for the specificity prediction method by Stachelhaus et al. (2). (b) Same as in (a) but the residues are in the space filling representation. (c) The residues in green (at a distance of up to 5.5 Å from phenylalanine) are surrounded by all the 34 residues (purple) at a distance of up to 8 Å from phenylalanine. The predictive method described here bases on these 34 amino acids and encodes them by their physico-chemical properties. Representations were created using BALLView (17, ).
Distribution of the 397 adenylation domains with known specificity on their substrates
| Specificity | Occurrence | Specificity | Occurrence | Specificity | Occurrence |
|---|---|---|---|---|---|
| 3-me-Glu | 1 | Dhb | 15 | Phe | 11 |
| 4pPro | 1 | Dhpg | 8 | Phg | 1 |
| Aad | 10 | Dht | 4 | Pip | 5 |
| Abu | 2 | D-lyserg | 1 | Pro | 16 |
| Aeo | 1 | Gln | 8 | Sal | 2 |
| Ala | 34 | Glu | 12 | Ser | 22 |
| Ala-b | 3 | Gly | 12 | Ser-Thr | 2 |
| Ala-d | 1 | His | 1 | Tcl | 1 |
| Alaninol | 1 | Hpg | 19 | Thr | 24 |
| Arg | 5 | Hyv-d | 1 | Trp | 3 |
| Asn | 14 | Ile | 11 | Tyr | 14 |
| Asp | 12 | Iva | 7 | Val | 27 |
| Bht | 7 | Leu | 31 | Valhyphaa | 1 |
| Bmt | 1 | Lys | 5 | Vol | 1 |
| Cys | 23 | Lys-b | 2 | ||
| Dab | 4 | Orn | 10 |
Besides the proteinogenic amino acid in three letter code there are the following known rare specificities: 3-me-Glu, 3-methyl-glutamate; 4pPro, 4-propyl-proline; Aad, 2-amino-adipic acid; Abu, 2-amino-butyric acid; Aeo, 2-amino-9,10-epoxy-8-oxodecanoic acid; Ala-b, β-alanine; Ala-d, d-alanine; Alaninol; Bht, beta-hydroxy-tyrosine; Bmt, (4R)-4[(E)-2-butenyl]-4-methyl-l-threonine; Dab, 2,4-diamino-butyric acid; Dhb, 2,3-dihydroxy-benzoic acid; Dhpg = Dpg, 3,5-dihydroxy-phenyl-glycine; Dht, dehydro-threonine = Dhbu = 2,3-dehydroaminobutyric acid; D-lyserg, d-lysergic acid; Hpg, 4-hydoxy-phenyl-glycine; Hyv-d, 2-hydroxy-valeric acid; Iva, isovaline; Lys-b, β-lysine; Orn, ornitine; Phg, phenyl-glycine; Pip, pipecolic acid; Sal, salicylic acid; Tcl, (4S)-5,5,5-trichloro-leucine; Valhyphaa, valine or hydrophobic amino acid; Vol, valinol.
Clustering of amino acids with similar physico-chemical properties and/or similar substrate binding pockets (6) into composite specificities
| Large clusters | Small clusters | ||
|---|---|---|---|
| Gly (12), Ala (20), Val (22), Leu (22), Ile (7), Abu (2), Iva (7) | Apolar, aliphatic side chains | Gly (12), Ala (20) | Tiny size, hydrophilic, transition to aliphatic |
| Val (22), Leu (22), Ile (7), Abu (2), Iva (7) | Aliphatic, branched hydrophobic side chain | ||
| Ser (13), Thr (16), Ser/Thr (1), Dhpg (7), Hpg (13) | Aliphatic chain or phenyl group with -OH | Ser (13) | Serine-specific |
| Thr (16) | Threonine-specific | ||
| Dhpg (7), Hpg (13) | Polar, uncharged (hydroxy-phenyl) | ||
| Phe (11), Trp (3), Phg (1), Tyr (12), Bht (6) | Aromatic side chain | Phe (11), Trp (3) | Unpolar aromatic ring |
| Tyr (12), Bht (6) | Polar aromatic ring | ||
| Asp (8), Asn (13), Glu (9), Gln (6), Aad (7) | Aliphatic chain ending with H-bond donor | Asp (8), Asn (13) | Asp-Asn-hydrogen bond acceptor |
| Glu (9), Gln (6) | Glu-Gln-hydrogen bond acceptor | ||
| Aad (7) | 2-Amino-adipic acid | ||
| Cys (17) | Polar, uncharged (aliphatic chain with -SH group at the end) | – | – |
| Orn (8), Lys (3), Arg (5) | Long positively charged side chain (aliphatic chain with -NH2 group at the end) | Orn (8) | Orn and hydroxy- Orn specific |
| Arg (5) | Arg-specific | ||
| Pro (16), Pip (4) | Cyclic aliphatic chain with polar | Pro (16) | Pro-specific |
| Dhb (9), Sal (2) | Hydroxy-benzoic acid derivates (no amino group) | No small cluster, no separation possible | – |
The numbers in parentheses denote the counts of domains with unique 8 Å sequence. Please note that the division of large into small clusters was not always possible owing to the small amount of available training data. Also see Figure 3.
Figure 3Venn diagram grouping amino acids by common physico-chemical properties according to Taylor (49). The colored sets show how similar amino acids have been clustered to composite specificities of A domains. To get larger clusters several smaller clusters were joined, as indicated by red lines connecting colored sets. This clustering is based on conclusions by Challis et al. (6) on cross-specificities of A domains and own groupings according to physical–chemical properties. An asterisk indicates rare non-proteinogenic amino acids, for abbreviations see Table 1.
Results of cross-validating the different SVMs by loo
| Specificity of SVM | Positive training points | Kernel type | Leave-one-out cross-validation | Quality of SVM | |||
|---|---|---|---|---|---|---|---|
| Error | Sn | Sp | MCC | ||||
| Large clusters | 282 Labeled and 664 unlabeled data points (18 + 646) | ||||||
| Dhb=Sal | 11 | l | 0.4 | 100 | 92 | 96 | ++ |
| Asp=Asn=Glu=Gln=Aad | 43 | r | 1.4 | 100 | 91 | 95 | ++ |
| Pro=Pip | 20 | r | 0.7 | 90 | 100 | 95 | ++ |
| Cys | 17 | r | 0.7 | 100 | 89 | 94 | ++ |
| Ser=Thr=Dhpg=Dpg=Hpg | 50 | r | 2.5 | 96 | 91 | 92 | ++ |
| Gly=Ala=Val=Leu=Ile=Abu=Iva | 92 | r | 4.3 | 95 | 93 | 90 | + |
| Orn=Lys=Arg | 16 | l | 0.7 | 88 | 88 | 87 | + |
| Phe=Trp=Phg=Tyr=Bht | 33 | r | 3.2 | 88 | 85 | 85 | 0 |
| Small clusters | 273 Labeled and 673 unlabeled data points (27 + 646) | ||||||
| Dhb=Sal | 11 | l | 0 | 100 | 100 | 100 | ++ |
| Aad | 7 | l | 0 | 100 | 100 | 100 | ++ |
| Glu=Gln | 15 | l | 0 | 100 | 100 | 100 | ++ |
| Dhpg=Dpg=Hpg | 20 | l | 0.4 | 100 | 95 | 97 | ++ |
| Ser | 13 | l | 0.4 | 92 | 100 | 96 | ++ |
| Cys | 17 | l | 0.7 | 100 | 89 | 94 | ++ |
| Thr | 16 | l | 0.7 | 94 | 94 | 93 | ++ |
| Pro | 16 | r | 0.7 | 94 | 94 | 93 | ++ |
| Asp=Asn | 21 | l | 1.1 | 90 | 95 | 92 | ++ |
| Val=Leu=Ile=Abu=Iva | 60 | l | 2.9 | 92 | 95 | 91 | + |
| Orn | 8 | l | 0.7 | 88 | 88 | 87 | + |
| Gly=Ala | 32 | l | 3.3 | 81 | 90 | 84 | 0 |
| Tyr | 18 | r | 2.2 | 94 | 77 | 84 | 0 |
| Arg | 5 | l | 0.7 | 80 | 80 | 80 | 0 |
| Phe=Trp | 14 | l | 3.7 | 57 | 67 | 60 | 0 |
The more training data that are available the more reliable the trained predictive models are. The ‘quality of SVM’ in the last column, therefore, is a qualitative measure for the MCC. Kernel type l stands for linear kernel and r stands for radial basis function kernel. Error rate, sensitivity (S), specificity (S) and Mathews correlation coefficient (MCC) are given in percentage.
Figure 4Results of a comparison of the new SVM-based method with the sequence-based prediction method based on the ‘specificity-confering code’ by Stachelhaus et al. (2) and Challis et al. (6) (For simplicity we refer to the latter as the ‘Stachelhaus method’): of the 1230 adenylation domains (with HMMER automatically extracted from the June 2005 version of UniProt) 70% or 858 obtained consistent predictions by both predictors (white sectors). For most of these consistent predictions (54% of the total or 666) the Stachelhaus method was based on an exact match with a known ‘specificity-conferring code’, the others had at least an 70% match. To 2.4% or 29 sequences none of the predictors can assign any specificity (no match ≥70%, diagonal hatches). An 18% or 217 sequences could be classified only by the SVMs and not by the Stachelhaus method (light gray sector), and 18 A domains (1.5%) could not be classified by the SVMs but by the Stachelhaus method (cross-hatched), two of them are rare specificities. The Stachelhaus predictions for the rest are mainly based on 70% matches to known specificity ‘codes’. For 108 sequences (8.8%) the predictions were inconsistent but 38 of them (3% of the total, gray sector) had matches to rare amino acids that were not used for training the SVMs. The remaining 70 incompatible predictions were mainly based on ≤80% identity matches with known ‘specificity-conferring codes’ (black sector).