| Literature DB >> 20072606 |
Marc Röttig1, Christian Rausch, Oliver Kohlbacher.
Abstract
An important aspect of the functional annotation of enzymes is not only the type of reaction catalysed by an enzyme, but also the substrate specificity, which can vary widely within the same family. In many cases, prediction of family membership and even substrate specificity is possible from enzyme sequence alone, using a nearest neighbour classification rule. However, the combination of structural information and sequence information can improve the interpretability and accuracy of predictive models. The method presented here, Active Site Classification (ASC), automatically extracts the residues lining the active site from one representative three-dimensional structure and the corresponding residues from sequences of other members of the family. From a set of representatives with known substrate specificity, a Support Vector Machine (SVM) can then learn a model of substrate specificity. Applied to a sequence of unknown specificity, the SVM can then predict the most likely substrate. The models can also be analysed to reveal the underlying structural reasons determining substrate specificities and thus yield valuable insights into mechanisms of enzyme specificity. We illustrate the high prediction accuracy achieved on two benchmark data sets and the structural insights gained from ASC by a detailed analysis of the family of decarboxylating dehydrogenases. The ASC web service is available at http://asc.informatik.uni-tuebingen.de/.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20072606 PMCID: PMC2796266 DOI: 10.1371/journal.pcbi.1000636
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Figure 1Graphical overview of the ASC method.
(A) In the first step training sequences are aligned using 3DCoffee to get an MSA. (B) In a second step residues lining the active site are extracted from the template structure. (C) The third step maps the extracted residues along the MSA to get a signature of the active site for each sequence. (D) These signatures are then encoded into feature vectors using the three descriptors . Alternatively, kernels may be used. (E) The final ASC model is trained using the generated feature vectors.
ASC performance on Hannenhalli benchmark.
| Family | 1NN | ASC | K | WDL | ASC | MSA | K | WDL | MTTSI | N |
| Cyclase | 0.66 |
| B | W |
| 0.38 | B | W | 0.2 | 137 |
| Kinase |
|
| S | D |
|
| S | D | 0.4 | 294 |
| Dehydrogenase | 0.95 |
| B | W |
|
| B | D | 0.4 | 376 |
| Trypsin | 0.81 |
| W | W |
| 0.80 | W | W | 0.5 | 78 |
F-measures of the nearest neighbour classifier, the ASC classifier and the best classifier based on the full MSA are given in the columns 1NN, ASC and MSA, respectively. The first part of the table compares ASC with the 1NN classifier and the column WDL gives the wins, draws and losses of the pairwise comparisons. The best performing kernels are given in the columns labelled K. Similarly, the second part of the table compares ASC with the SVM classifier based on the full MSA. The last columns give the used MTTSI value and the number of available sequences.
ASC performance on Capra benchmark.
| EC pair | 1NN | ASC | WDL | ASC | MSA | WDL | MTTSI |
| 1.1.1.100/1.1.1.62 | 0.72 |
| W | 0.76 |
| L | 0.40 |
| 1.1.1.103/1.1.1.284 |
|
| D |
|
| D | 0.40 |
| 1.1.1.1/1.1.1.103 | 0.96 |
| W |
|
| D | 0.40 |
| 1.1.1.1/1.1.1.284 | 0.56 |
| W |
| 0.74 | W | 0.60 |
| 1.1.1.41/1.1.1.42 |
| 0.72 | L | 0.72 |
| L | 0.50 |
| 1.1.1.41/1.1.1.85 |
|
| D |
| 0.96 | W | 0.60 |
| 1.1.1.42/1.1.1.85 | 0.98 |
| W |
| 0.98 | W | 0.60 |
| 1.2.1.3/1.2.1.71 |
|
| D |
|
| D | 0.50 |
| 1.2.1.3/1.2.1.8 | 0.92 |
| W |
| 0.83 | W | 0.60 |
| 1.4.1.3/1.4.1.4 |
|
| D |
|
| D | 0.60 |
| 1.8.1.4/1.8.1.7 |
|
| D |
|
| D | 0.60 |
| 2.1.2.2/2.1.2.9 | 0.88 |
| W |
|
| D | 0.30 |
| 2.1.3.2/2.1.3.3 | 0.77 |
| W |
| 0.91 | W | 0.30 |
| 2.2.1.1/2.2.1.7 | 0.99 |
| W |
| 0.98 | W | 0.50 |
| 2.3.1.16/2.3.1.9 | 0.81 |
| W | 0.93 |
| L | 0.50 |
| 2.4.2.10/2.4.2.7 | 0.84 |
| W |
| 0.90 | W | 0.30 |
| 2.4.2.22/2.4.2.8 |
|
| D |
|
| D | 0.40 |
| 2.4.2.8/2.4.2.9 |
|
| D |
| 0.91 | W | 0.40 |
| 2.5.1.10/2.5.1.29 | 0.42 |
| W |
| 0.34 | W | 0.40 |
| 2.5.1.1/2.5.1.10 | 0.52 |
| W | 0.58 |
| L | 0.50 |
| 2.5.1.1/2.5.1.29 | 0.53 |
| W |
| 0.73 | W | 0.50 |
| 2.6.1.11/2.6.1.62 |
|
| D |
| 0.88 | W | 0.40 |
| 2.6.1.11/2.6.1.13 |
|
| D |
| 0.83 | W | 0.50 |
| 2.6.1.11/2.6.1.76 |
|
| D |
| 0.46 | W | 0.40 |
| 2.6.1.13/2.6.1.62 |
|
| D |
|
| D | 0.40 |
| 2.6.1.13/2.6.1.76 |
|
| D |
|
| D | 0.40 |
| 2.6.1.1/2.6.1.9 |
|
| D |
|
| D | 0.40 |
| 2.6.1.62/2.6.1.76 |
|
| D |
|
| D | 0.40 |
| 2.7.2.11/2.7.2.8 |
|
| D |
|
| D | 0.40 |
| 2.7.3.2/2.7.3.3 | 0.96 |
| W |
| 0.96 | W | 0.70 |
| 3.1.1.1/3.1.1.7 | 0.68 |
| W |
| 0.85 | W | 0.40 |
| 3.5.3.1/3.5.3.8 |
|
| D |
| 0.94 | W | 0.40 |
| 3.6.3.6/3.6.3.8 |
| 0.76 | L | 0.76 |
| L | 0.40 |
| 4.1.1.17/4.1.1.20 |
|
| D |
|
| D | 0.40 |
| 4.2.1.3/4.2.1.33 | 0.98 |
| W |
| 0.37 | W | 0.40 |
| 4.2.1.3/4.2.1.36 |
|
| D |
| 0.70 | W | 0.40 |
| 4.3.1.3/4.3.1.5 |
|
| D |
| 0.93 | W | 0.40 |
| 4.3.2.1/4.3.2.2 |
| 0.97 | L | 0.97 |
| L | 0.40 |
| 5.1.3.2/5.1.3.20 |
|
| D |
|
| D | 0.50 |
| 5.4.2.10/5.4.2.2 |
| 0.62 | L | 0.62 |
| L | 0.40 |
| 6.1.1.11/6.1.1.15 |
| 0.97 | L | 0.97 |
| L | 0.50 |
| 6.1.1.12/6.1.1.22 | 0.94 |
| W |
| 0.91 | W | 0.40 |
| 6.1.1.12/6.1.1.6 |
|
| D |
|
| D | 0.40 |
| 6.1.1.15/6.1.1.3 | 0.99 |
| W |
| 0.89 | W | 0.40 |
| 6.1.1.17/6.1.1.18 | 0.98 |
| W |
| 0.98 | W | 0.60 |
| 6.3.2.13/6.3.2.8 |
|
| D |
|
| D | 0.40 |
| 6.3.2.13/6.3.2.9 | 0.83 |
| W |
| 0.93 | W | 0.30 |
| 6.3.2.8/6.3.2.9 | 0.85 |
| W |
| 0.92 | W | 0.30 |
F-measures of the nearest neighbour classifier, the ASC classifier and the best classifier based on the full MSA are given in the columns 1NN, ASC and MSA, respectively. The first part of the table compares ASC with the 1NN classifier and the column WDL gives the wins, draws and losses of the pairwise comparisons. Similarly, the second part of the table compares ASC with the SVM classifier based on the full MSA. The last column gives the used MTTSI value.
Decarboxylating dehydrogenases active site signature.
| Number | Amino acid | CORE score | Number | Amino acid | CORE score |
| 1 | Val73 | 8 | 10 | Lys190* | 9 |
| 2 | Glu88 | 6 | 11 | Asn192* | 9 |
| 3 | Leu91 | 7 | 12 | Val193* | 7 |
| 4 | Leu92 | 8 | 13 | Asp222* | 9 |
| 5 | Arg95 | 9 | 14 | Asn242 | 9 |
| 6 | Arg105 | 8 | 15 | Asp246 | 8 |
| 7 | Arg133 | 9 | 16 | Asp250 | 9 |
| 8 | Leu135 | 7 | 17 | Glu275 | 9 |
| 9 | Tyr140 | 9 |
Residue identifiers are taken from the template crystal structure (PDB-Id: 1A05). Residues highlighted with asterisks are from chain B of the homo-dimeric enzyme. The CORE scores are those from the family MSA generated by 3DCoffee.
Figure 2View on the superimposed active sites of IPMDH and ICDH.
The first chain of the homo-dimeric enzyme is represented by its solvent-excluded surface. The second chain is depicted in a backbone representation. The two substrates isocitrate (purple) and isopropylmalate (green) lie in the interface of the two chains. IPMDH sidechains are coloured green and sidechains from ICDH (PDB-Id: 1AI2, [40]) are coloured purple. This figure was created using BALLView [41].