| Literature DB >> 21524307 |
Alvaro Olivera-Nappa1, Barbara A Andrews, Juan A Asenjo.
Abstract
BACKGROUND: Functionally relevant artificial or natural mutations are difficult to assess or predict if no structure-function information is available for a protein. This is especially important to correctly identify functionally significant non-synonymous single nucleotide polymorphisms (nsSNPs) or to design a site-directed mutagenesis strategy for a target protein. A new and powerful methodology is proposed to guide these two decision strategies, based only on conservation rules of physicochemical properties of amino acids extracted from a multiple alignment of a protein family where the target protein belongs, with no need of explicit structure-function relationships.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21524307 PMCID: PMC3123232 DOI: 10.1186/1471-2105-12-122
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Flow diagram of the whole MOSST algorithm. The algorithm can be alternatively used for the development of rational protein design strategies or for the identification of functionally significant nsSNPs.
Nomenclature and abbreviations
| Variable name | Description |
|---|---|
| Posterior cumulative distribution function for the values of the property | |
| Cumulative distribution function of the absolute difference between the variances of any two properties | |
| Cumulative distribution function for a random combination of | |
| Null hypothesis in the hypothesis test for differences | |
| Alternative hypothesis in the hypothesis test for differences | |
| Position number in the multiple alignment | |
| Invariable determinant position | |
| Amino acid numbering sub-index | |
| Total number of physicochemical properties considered in the MOSST analysis | |
| Number of possible pairwise combinations (subsets having two elements) of properties | |
| Number of amino acids in a multiple alignment position (without taking gaps into account) | |
| Total number of positions in the multiple alignment | |
| Normal distribution function with unknown parameters | |
| Number of amino acids in position | |
| Non-synonymous single nucleotide polymorphism | |
| Value of the | |
| Probability of getting a certain value or less for the absolute difference | |
| Posterior probability distribution function for the values of the property | |
| Global probability of the null hypothesis being true considering all the properties together at position | |
| Probability of getting a certain value or less for the absolute difference | |
| Probability of getting a certain value or less for the absolute difference | |
| Cumulative probability of getting a certain value or less for the variance ( | |
| Discrete probability mass function for a random combination of | |
| Cumulative probability of getting a certain value or less for the variance for a given property | |
| Global probability for the occurrence of the amino acid | |
| Probability of the amino acid | |
| Property numbering sub-index | |
| Variable determinant position | |
| Variable irrelevant position | |
| Unknown quantity or variable | |
| Measure, value or score of the physicochemical property | |
| Measure, value or score of the physicochemical property | |
| Absolute difference between the variances of any two properties | |
| Absolute difference between the variances of any two properties | |
| Absolute difference between the variances of any two properties | |
| Mean (of a normal distribution) | |
| Arithmetical average of the physicochemical property | |
| Arithmetical average of the physicochemical property | |
| Variance (of a normal distribution) | |
| Sample variance estimator (standard deviation) of the physicochemical property | |
| Sample variance estimator (standard deviation) of the physicochemical property | |
| Test statistic for the ( | |
| Generic physicochemical property | |
| Physicochemical properties 1, 2 and 3 | |
| Physicochemical properties |
Figure 2Example multiple alignment of seven protein amino acid sequences. Each example protein has 16, 15, 15, 14, 15, 12 and 13 amino acids, respectively. The multiple alignment has a length of 18 positions, which means that every sequence has at least 2 gaps. The calculation of the mean and variance of the property Ω is shown in detail for position 9 of the alignment (= 9), with 6 amino acids and one gap (= 6: 1 histidine, 3 threonines and 2 serines).
Figure 3Typical plots of the cumulative distribution functions (CDF) of sample variances. In these plots, is a calculated sample variance for any combination of amino acids and is the associated probability of obtaining such sample variance value for the property Ω just randomly choosing amino acids, and CDF profiles vary depending on the number of amino acids selected. Continuing with the example of Figure 2, the sample variance has an associated probability that can be found using the corresponding .
Figure 4Scheme depicting the three possible conservation cases described in the text. For each different position () of a multiple alignment, the significance levels corresponding to three amino acid properties of the example in Figure 2 () are plotted, to determine the differential conservation of the properties.
Figure 5Scheme depicting the calculation of differences . Large paired differences exist maximally only when one property is strictly conserved while the others are not. This is exploited to combine evidence by integrating individual significances from different pairs of properties.
Figure 6Typical plots of the cumulative distribution functions (CDF) of the differences of sample variances. In these plots, Δis a calculated difference between sample variances for any combination of amino acids and is the associated probability of obtaining such difference of sample variances for the property Ω just randomly choosing amino acids, and the CDF profiles vary depending on the number of amino acids selected. Continuing with the example of Figure 2, a calculated difference Δof the sample variances between the properties Ωand Ωat position = 9 and = 6, has an associated probability that can be found using the corresponding . Therefore, the significance level of the null hypothesis is = 1 - .
Figure 7Scheme depicting the posterior probability density function (PDF) of the values of a property Ω. Continuing with the example of Figure 2, given the previously known group of amino acids present in the multiple alignment at position for a given amino acid aa having a value assigned for the property Ω, and the average for Ω at position being , then the shadowed area represents the probability that the amino acid aa could be present at the position , according to the property Ω.
Figure 8Scree plot of the probabilities . The probabilities are sorted from highest (left) to lowest (right). A fall contrast or scree criterium is applied to this plot to identify a cut point in the curve (dashed line), where the highest probability factors are chosen up to a point where the curve becomes approximately horizontal.
Analyzed endoglucanases belonging to glycosyl hydrolases family 16
| # | Swiss Prot code | EMBL or GenBank code | Description | Organism | Enzymatic classification |
|---|---|---|---|---|---|
| 1 | β-1,3-glucanase II (BglII) | Laminarinase | |||
| 2 | β-1,3-glucanase IIa (BglIIa) | Laminarinase | |||
| 3 | Laminarinase | Laminarinase | |||
| 4 | Laminarinase | Laminarinase | |||
| 5 | endo-β-1,3-glucanase (precursor) | Putative laminarinase | |||
| 6 | Laminarinase | Laminarinase | |||
| 7 | U04836 | β-glucanase (precursor) | Lichenase (EC 3.2.1.73) | ||
| 8 | Q45095 | β-1,3-glucanase bglH (precursor) | Putative lichenase | ||
| 9 | M34503 | Glucan endo-1,3-β-glucosidase A1 (precursor) | Laminarinase (EC 3.2.1.39) | ||
| 10 | endo-1,3-1,4-β-glucanase eglC | Putative lichenase | |||
| 11 | endo-1,3-1,4-β-glucanase exsH | Putative lichenase | |||
| 12 | M15674 | β-glucanase (precursor) | Lichenase (EC 3.2.1.73) | ||
| 13 | endo-β-1,3-1,4-glucanase | Putative lichenase | |||
| 14 | β-glucanase (precursor) | Lichenase (EC 3.2.1.73) | |||
| 15 | - | CAA81096 ( | hybrid endo-1,3-1,4-β-glucanase (synthetic construct) | Putative lichenase | |
| 16 | X57279 | β-glucanase (precursor) | Lichenase (EC 3.2.1.73) | ||
| 17 | - | CAA81092 ( | hybrid endo-1,3-1,4-β-glucanase (synthetic construct) | Putative lichenase | |
| 18 | - | CAA81094 ( | hybrid endo-1,3-1,4-β-glucanase (synthetic construct) | Putative lichenase | |
| 19 | X57094 | β-glucanase (precursor) | Lichenase (EC 3.2.1.73) | ||
| 20 | β-glucanase (precursor) | Lichenase (EC 3.2.1.73) | |||
| 21 | X63355 | β-glucanase (precursor) | Lichenase (EC 3.2.1.73) | ||
| 22 | β-glucanase (precursor) | Lichenase (EC 3.2.1.73) | |||
| 23 | M84339 | β-glucanase (precursor) | Lichenase (EC 3.2.1.73) | ||
| 24 | Endo-1,3(4)-β-glucanase | Laminarinase (EC 3.2.1.6) | |||
| 25 | β-1,3-glucanase | Putative laminarinase | |||
| 26 | - | β-1,3-glucanase | Putative laminarinase |
Descriptor codes in the multiple alignment are highlighted in bold characters.
Figure 9Auxiliary plots for redundancy removal in the analyzed protein family. The top dendrogram shows the clustering of proteins according to the distance (similarity percentage) between them. The bottom plot is an agglomeration distance plot of the top dendrogram. In both plots a horizontal line representing a similarity percentage of 86% that was taken as the limit over which two proteins were considered as identical. This value is the minimal value within the most pronounced step in the agglomeration distance plot. The numbers of the different proteins are the order numbers assigned in Table 2.
Figure 10Result plots for the global significance of the positions (top) and the significances of variances in each component (bottom). NLSDV: negative of the base-10 logarithm of the significance of the difference of variances; NLSV: negative of the base-10 logarithm of the significance of variances.
Figure 113D mapping of the amino acids onto the 3D structure of BglII. Mutagenically interesting positions (light grey) are mapped over the molecular structure of the catalytic domain of BglII, selected as a representative structure of family 16 glycosyl hydrolases (order number 1 in Table 2). This figure is a cross-eyed stereogram.
Figure 12Result plots for the amino acids that form the active site of BglII. Global significance of positions and significances of variances for each component, for positions in the multiple alignment corresponding to amino acids that form the active site of BglII (and other family 16 glycosyl hydrolases): (a) positions 20-60; (b) positions 160-200; (c) positions 200-240; and (d) positions 290-330.
Figure 13Comparative mapping of primary variable determinant positions. Primary variable determinant positions for all the studied protein family (lichenases and laminarinases) are shown in white and black. White positions are variable determinant amino acids both for lichenases and laminarinases, while black positions are primary variable determinant positions specific for laminarinases. This figure is a cross-eyed stereogram.