| Literature DB >> 24900985 |
Nelson Kibinge1, Shun Ikeda1, Naoaki Ono1, Md Altaf-Ul-Amin1, Shigehiko Kanaya1.
Abstract
Progress in the "omics" fields such as genomics, transcriptomics, proteomics, and metabolomics has engendered a need for innovative analytical techniques to derive meaningful information from the ever increasing molecular data. KNApSAcK motorcycle DB is a popular database for enzymes related to secondary metabolic pathways in plants. One of the challenges in analyses of protein sequence data in such repositories is the standard notation of sequences as strings of alphabetical characters. This has created lack of a natural underlying metric that eases amenability to computation. In view of this requirement, we applied novel integration of selected biochemical and physical attributes of amino acids derived from the amino acid index and quantified in numerical scale, to examine diversity of peptide sequences of terpenoid synthases accumulated in KNApSAcK motorcycle DB. We initially generated a reduced amino acid index table. This is a set of biochemical and physical properties obtained by random forest feature selection of important indices from the amino acid index. Principal component analysis was then applied for characterization of enzymes involved in synthesis of terpenoids. The variance explained was increased by incorporation of residue attributes for analyses.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24900985 PMCID: PMC4034505 DOI: 10.1155/2014/753428
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1Random forest algorithm. Mechanism of the random forest (RF) algorithm starting from the data selection by bootstrapping up to variable importance calculation. The amino acid index data containing physicochemical metrics of amino acids was subjected to RF for index selection.
Figure 2KNApSAcK motorcycle database. Enzyme-reaction database. (a) The main window of motorcycle. (b) An example of a keyword search. (c) An example of BLASTP search.
Figure 3Variable importance scores (1000 RF trials). Variable importance score distribution for 1000 runs of random forest classification on the amino acid index. Each boxplot represents distribution of each property (also called variable) represented on the horizontal axis. The properties have been ordered in decreasing order of the median score (red line in boxplot). For easier visualization, the set has been truncated to show the top 50 properties. The corresponding properties are shown in Table 1.
Top 50 ranked indices. Descriptions of the top 50 indices by ranking of the VIM scores prior to nested RF. The first column represents the AAindex access ID, while the second is the corresponding BPP. Please refer to the amino acid index database in the Supplementary Material available online at http://dx.doi.org/10.1155/2014/753428 for the references in column 2.
| ID | Property | |
|---|---|---|
| 1 | RADA880101 | Information value for accessibility; average fraction 35% (Biou et al., 1988) |
| 2 | ROSM880101 | Information value for accessibility; average fraction 23% (Biou et al., 1988) |
| 3 | KIDA850101 | Retention coefficient in TFA (Browne et al., 1982) |
| 4 | EISD840101 | Normalized hydrophobicity scales for alpha + beta-proteins (Cid et al., 1992) |
| 5 | JACR890101 | Normalized hydrophobicity scales for alpha/beta-proteins (Cid et al., 1992) |
| 6 | COWR900101 | Consensus normalized hydrophobicity scale (Eisenberg, 1984) |
| 7 | BLAS910101 | Direction of hydrophobic moment (Eisenberg-McLachlan, 1986) |
| 8 | MEEJ810101 | Hydrophobic parameter pi (Fauchere-Pliska, 1983) |
| 9 | CIDH920104 | Number of hydrogen bond donors (Fauchere et al., 1988) |
| 10 | GRAR740102 | Number of full nonbonding orbitals (Fauchere et al., 1988) |
| 11 | ZIMJ680103 | Polarity (Grantham, 1974) [ |
| 12 | MEEJ810102 | Hydration number (Hopfinger, 1971), Cited by Charton-Charton (1982) |
| 13 | RADA880104 | Hydropathy index (Kyte-Doolittle, 1982) |
| 14 | KUHL950101 | Hydrophobic parameter (Levitt, 1976) |
| 15 | FAUJ880110 | Conformational preference for parallel beta-strands (Lifson-Sander, 1979) |
| 16 | RADA880107 | Average surrounding hydrophobicity (Manavalan-Ponnuswamy, 1978) |
| 17 | WARP780101 | Retention coefficient in NaClO4 (Meek-Rossetti, 1981) |
| 18 | BIOV880102 | Retention coefficient in NaH2PO4 (Meek-Rossetti, 1981) |
| 19 | BIOV880101 | 8 A contact number (Nishikawa-Ooi, 1980) |
| 20 | FASG890101 | Partition coefficient (Pliska et al., 1981) |
| 21 | PONP800108 | Average number of surrounding residues (Ponnuswamy et al., 1980) |
| 22 | FAUJ830101 | Hydrophobicity (Prabhakaran, 1990) |
| 23 | CORJ870101 | Weights for beta-sheet at the window position of 2 (Qian-Sejnowski, 1988) |
| 24 | MANP780101 | Transfer-free energy from chx to wat (Radzicka-Wolfenden, 1988) |
| 25 | LIFS790102 | Transfer-free energy from chx to oct (Radzicka-Wolfenden, 1988) |
| 26 | GUOD860101 | Energy transfer from out to in (95% buried) (Radzicka-Wolfenden, 1988) |
| 27 | WOLS870101 | Mean polarity (Radzicka-Wolfenden, 1988) |
| 28 | WOLR790101 | Side chain hydropathy, uncorrected for solvation (Roseman, 1988) |
| 29 | LEVM760101 | Side chain hydropathy, corrected for solvation (Roseman, 1988) |
| 30 | WOLR810101 | Normalized frequency of chain reversal D (Tanaka-Scheraga, 1977) |
| 31 | ROSM880102 | Average interactions per side chain atom (Warme-Morgan, 1978) |
| 32 | WOEC730101 | Polar requirement (Woese, 1973) |
| 33 | PLIV810101 | Hydration potential (Wolfenden et al., 1981) |
| 34 | RADA880108 | Principal property value z1 (Wold et al., 1987) |
| 35 | FAUJ880109 | Polarity (Zimmerman et al., 1968) |
| 36 | HOPA770101 | Free energies of transfer of AcWl-X-LL peptides from bilayer interface to water (Wimley-White, 1996) |
| 37 | CIDH920103 | Hydropathy scale based on self-information values in the two-state model (9% accessibility) (Naderi-Manesh et al., 2001) |
| 38 | TANS770106 | Hydropathy scale based on self-information values in the two-state model (16% accessibility) (Naderi-Manesh et al., 2001) |
| 39 | PRAM900101 | Hydrophilicity scale (Kuhn et al., 1995) |
| 40 | ENGD860101 | Retention coefficient at pH 2 (Guo et al., 1986) |
| 41 | BROC820101 | Modified Kyte-Doolittle hydrophobicity scale (Juretic et al., 1998) |
| 42 | NADH010103 | Knowledge-based membrane-propensity scale from 1D_Helix in MPtopo databases (Punta-Maritan, 2003) |
| 43 | NADH010102 | Hydrophobicity index (Wolfenden et al., 1979) |
| 44 | KYTJ820101 | Hydrophobicity-related index (Kidera et al., 1985) |
| 45 | EISD860103 | Weights from the IFH scale (Jacobs-White, 1989) |
| 46 | NISK800101 | Hydrophobicity index, 3.0 pH (Cowan-Whittaker, 1990) |
| 47 | JURD980101 | Scaled side chain hydrophobicity values (Black-Mould, 1991) |
| 48 | WIMW960101 | NNEIG index (Cornette et al., 1987) |
| 49 | QIAN880122 | Hydrophobicity index (Engelman et al., 1986) |
| 50 | PUNT030101 | Hydrophobicity index (Fasman, 1989) |
Figure 4Variation of the BPP importance. Standard deviation of the importance scores of the properties (y-axis); models contribution of each property to the performance of the RF algorithm. Those variables with a close to zero variation are less “important.” At the tail end, variance is more than zero due to chance.
Figure 5Nested random forest error rates. Nested random forest variable selection: variables have been ordered by their importance scores, new RF models are built by single variable addition in the nested RF setup, and RF error rates are measured. In this experiment, each box and whisker plot represents the distribution of error rates from 100 trials at each nested RF step. In total, there were 93 steps corresponding to the 93 top indices from previous step. The y-axis shows the error rates as a percentage. The threshold of acceptable mean error rate was set at 2 percent shown by the red horizontal line.
Reduced Amino Acid Index (rAAindex). The first column contains the names of the 20 amino acids that make up protein sequences. Each of the 8 subsequent columns is a BPP attribute selected based on its importance in explaining variation of the amino acid. Each row is a vector describing an amino acid in 8 dimensions, each of which represents a physical property tabulated in Table 3.
| Amino acid | JACR890101 | COWR900101 | ZIMJ680103 | MEEJ810102 | FAUJ880110 | WARP780101 | PONP800108 | LIFS790102 |
|---|---|---|---|---|---|---|---|---|
| Ala | 0.18 | 0.42 | 0.00 | 1.00 | 0.00 | 10.04 | 6.05 | 1.00 |
| Arg | −5.40 | −1.56 | 52.00 | −2.00 | 3.00 | 6.18 | 5.70 | 0.68 |
| Asn | −1.30 | −1.03 | 3.38 | −3.00 | 3.00 | 5.63 | 5.04 | 0.54 |
| Asp | −2.36 | −0.51 | 49.70 | −0.50 | 4.00 | 5.76 | 4.95 | 0.50 |
| Cys | 0.27 | 0.84 | 1.48 | 4.60 | 0.00 | 8.89 | 7.86 | 0.91 |
| Gln | −1.22 | −0.96 | 3.53 | −2.00 | 3.00 | 5.41 | 5.45 | 0.28 |
| Glu | −2.10 | −0.37 | 49.90 | 1.10 | 4.00 | 5.37 | 5.10 | 0.59 |
| Gly | 0.09 | 0.00 | 0.00 | 0.20 | 0.00 | 7.99 | 6.16 | 0.79 |
| His | −1.48 | −2.28 | 51.60 | −2.20 | 1.00 | 7.49 | 5.80 | 0.38 |
| Ile | 0.37 | 1.81 | 0.13 | 7.00 | 0.00 | 8.72 | 7.51 | 2.60 |
| Leu | 0.41 | 1.80 | 0.13 | 9.60 | 0.00 | 8.79 | 7.37 | 1.42 |
| Lys | −2.53 | −2.03 | 49.50 | −3.00 | 1.00 | 4.40 | 4.88 | 0.59 |
| Met | 0.44 | 1.18 | 1.43 | 4.00 | 0.00 | 9.15 | 6.39 | 1.49 |
| Phe | 0.50 | 1.74 | 0.35 | 12.60 | 0.00 | 7.98 | 6.62 | 1.30 |
| Pro | −0.20 | 0.86 | 1.58 | 3.10 | 0.00 | 7.79 | 5.65 | 0.35 |
| Ser | −0.40 | −0.64 | 1.67 | −2.90 | 2.00 | 7.08 | 5.53 | 0.70 |
| Thr | −0.34 | −0.26 | 1.66 | −0.60 | 2.00 | 7.00 | 5.81 | 0.59 |
| Trp | −0.01 | 1.46 | 2.10 | 15.10 | 0.00 | 8.07 | 6.98 | 0.89 |
| Tyr | −0.08 | 0.51 | 1.61 | 6.70 | 2.00 | 6.90 | 6.73 | 1.08 |
| Val | 0.32 | 1.34 | 0.13 | 4.60 | 0.00 | 8.88 | 7.62 | 2.63 |
Annotation of the selected properties. Descriptions of the 8 indices selected after the nested random forest variable selection. The first column represents the AAindex access ID, while the second is the corresponding BPP. Please refer to the amino acid index database for the references in column 2.
| ID | Property | |
|---|---|---|
| 1 | JACR890101 | Number of full nonbonding orbitals (Fauchere et al., 1988) |
| 2 | COWR900101 | Conformational preference for parallel beta-strands (Lifson-Sander, 1979) |
| 3 | ZIMJ680103 | Retention coefficient in NaH2PO4 (Meek-Rossetti, 1981) |
| 4 | MEEJ810102 | Average number of surrounding residues (Ponnuswamy et al., 1980) |
| 5 | WARP780101 | Average interactions per side chain atom (Warme-Morgan, 1978) |
| 6 | FAUJ880110 | Polarity (Zimmerman et al., 1968) |
| 7 | PONP800108 | Weights from the IFH scale (Jacobs-White, 1989) |
| 8 | LIFS790102 | Hydrophobicity index, 3.0 pH (Cowan-Whittaker, 1990) |
Figure 6Principal component analysis of terpenoid synthases. Left: principal component analysis of the terpenoid synthase subfamilies where amino acid residues are encoded in 8-bit binary method. PC1 and PC2 show a combined variance of 14.84 percent explained variance. Right: principal component analysis of the same data set encoded using the biochemical and physical properties of amino acid residues. PC1 and PC2 in this case explain 30.02 percent variance. The four subfamilies clustered are monoterpenoid, diterpenoid, triterpenoid, and sesquiterpenoid synthases. Triterpenoid synthases and subgroups of terpenoid synthases are distinctly different in structure compared to the other synthases.