Literature DB >> 20624731

Temperature adaptation at homologous sites in proteins from nine thermophile-mesophile species pairs.

John H McDonald1.   

Abstract

Whether particular amino acids are favored by selection at high temperatures over others has long been an open question in protein evolution. One way to approach this question is to compare homologous sites in proteins from one thermophile and a closely related mesophile; asymmetrical substitution patterns have been taken as evidence for selection favoring certain amino acids over others. However, most pairs of prokaryotic species that differ in optimum temperature also differ in genome-wide GC content, and amino acid content is known to be associated with GC content. Here, I compare homologous sites in nine thermophilic prokaryotes and their mesophilic relatives, all with complete published genome sequences. After adjusting for the effects of differing GC content with logistic regression, 139 of the 190 pairs of amino acids show significant substitutional asymmetry, evidence of widespread adaptive amino acid substitution. The patterns are fairly consistent across the nine pairs of species (after taking the effects of differing GC content into account), suggesting that much of the asymmetry results from adaptation to temperature. Some amino acids in some species pairs deviate from the overall pattern in ways indicating that adaptation to other environmental or physiological differences between the species may also play a role. The property that is best correlated with the patterns of substitutional asymmetry is transfer free energy, a measure of hydrophobicity, with more hydrophobic amino acids favored at higher temperatures. The correlation of asymmetry and hydrophobicity is fairly weak, suggesting that other properties may also be important.

Entities:  

Mesh:

Substances:

Year:  2010        PMID: 20624731      PMCID: PMC2997543          DOI: 10.1093/gbe/evq017

Source DB:  PubMed          Journal:  Genome Biol Evol        ISSN: 1759-6653            Impact factor:   3.416


Introduction

Thermophilic organisms live at 50 °C to over 100 °C, temperatures that would quickly denature most proteins from mesophiles. There is considerable interest in determining what enables proteins from thermophiles to function at high temperatures, both for the practical benefit of engineering proteins for high-temperature industrial processes and as an evolutionary and biochemical puzzle. One way to investigate whether some amino acids are more favorable than others at higher temperature is to compare the overall proportions of amino acids in protein sequences from prokaryotes living at different temperatures (Cambillau and Claverie 2000; Fukuchi and Nishikawa 2001; Chakravarty and Varadarajan 2002; Singer and Hickey 2003; Berezovsky et al. 2007). An amino acid that is more abundant in species living at higher temperatures is then interpreted to be adaptive to the higher temperatures. However, a major problem with this approach is that prokaryotes vary widely in genome-wide GC content, and amino acids with GC-rich codons are generally more abundant in organisms with GC-rich genomes (Lobry 1997; Singer and Hickey 2000). There is conflicting evidence about whether genome-wide GC content shows any relationship with habitat temperature (Musto et al. 2006; Wang et al. 2006), but the strong association of GC content and amino acid abundance will obscure any relationship between temperature and amino acid abundance if the variation in GC content is ignored. The effects of temperature and GC content can be separated using multivariate statistical techniques, such as principal component analysis (Kreil and Ouzounis 2001; Saunders et al. 2003), correspondence analysis (Tekaia et al. 2002; Lobry and Chessel 2003; Tekaia and Yeramian 2006; Boussau et al. 2008; Puigbò et al. 2008), and other techniques (Naya et al. 2006; Zeldovich et al. 2007). However, these approaches suffer from “phylogenetic pseudoreplication”; they treat multiple species from the same clade and same habitat as if they were independent samples, and it has long been known that this can cause serious statistical problems (Felsenstein 1985; Harvey and Pagel 1991). To illustrate why this is a problem, imagine biologists who were interested in temperature adaptation of terrestrial vertebrates. If those biologists surveyed vertebrates from a variety of habitats and looked for associations with temperature, they would see a higher proportion of species that shed their skin living in warmer areas. However, it would be erroneous to conclude from this that shedding skin is an adaptation to high temperature; the association would merely result from sampling large numbers of Squamata (lizards and snakes) in warm areas and few squamates in cold areas. Similarly, in studies of temperature and amino acid composition, some clades are found predominantly among thermophiles, and some are predominant among mesophiles; for example, of the 204 species studied by Zeldovich et al. (2007), 63% of the thermophiles and 5% of the mesophiles are archaea, whereas 0% of the thermophiles and 54% of the mesophiles are proteobacteria. A multivariate statistical technique that treated each species as an independent data point could produce an apparent association of particular amino acids with higher temperatures, when in reality that association might result from a difference between clades that may have nothing to do with temperature. A second form of evidence used to compare amino acid composition in mesophiles and thermophiles is substitutional asymmetry (Argos et al. 1979; Haney et al. 1999; McDonald et al. 1999). Protein sequences from one mesophile and one thermophile are aligned, and the observation of more aligned sites with amino acid A in the mesophile and B in the thermophile than the opposite pattern provides evidence that B is favored over A in the higher temperature organism. Because only aligned sites in homologous proteins are considered, the effect of gain or loss of proteins of different amino acid composition does not obscure the results. In addition, each mesophile–thermophile pair of species can be phylogenetically independent of others that have been compared, an important consideration when using comparative methods to infer adaptation. (To say that mesophile–thermophile pair A and B are “phylogenetically independent” of other pairs means that A and B are more closely related to each other than either is to any of the other species in the data set.) This approach has found extensive evidence for substitutional asymmetry (Haney et al. 1999; McDonald et al. 1999; McDonald 2001; Nishio et al. 2003; Mizuguchi et al. 2007), but the problem remains that for those pairs of amino acids whose codons have different GC content, overall differences in GC content between the mesophile and thermophile could still be the cause of substitutional asymmetry. Here, I use logistic regression of the proportion of substitutions in one direction versus the overall difference in GC content to predict the substitutional asymmetry in a pair of species with identical genomic GC content. This method should help determine whether amino acids that are favored at higher temperatures share biochemical properties. If substitutional asymmetry between mesophilic and thermophilic proteins results from temperature adaptation based on the fundamental biochemical properties of the amino acids, the same patterns should be found in all mesophile–thermophile comparisons after controlling for differences in GC content. Differences in other aspects of the environment, such as salinity, hydrostatic pressure, pH, oxygen, and nutrient source, could cause patterns of asymmetry that are unrelated to temperature and therefore different in different mesophile–thermophile pairs. In addition, biosynthetic costs of amino acids are high enough to cause selection on amino acid usage (Akashi and Gojobori 2002; Seligmann 2003; Heizer et al. 2006; Swire 2007), so organisms which differ in biosynthetic pathways, or which differ in whether they are autotrophic or heterotrophic for a particular amino acid, may have different patterns of substitutional asymmetry. A second goal of this paper is to see how consistent the patterns of substitutional asymmetry are among different species, which may help determine how much of the asymmetry is due to temperature adaptation and how much is due to other factors.

Materials and Methods

Choice of Mesophile–Thermophile Pairs

The NCBI Entrez Genome Project database (http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj) was searched for thermophilic archaea and bacteria (optimum growth temperature, Topt, greater than or equal to 50 °C) with complete, published genome sequences. Species from higher taxa in which all species with published genomes are thermophiles, such as Aquificae and Crenarchaeota, were excluded. The closest mesophile (Topt ≤ 40 °C) with a complete published genome sequence was identified for each thermophile using published phylogenies. Where a thermophile had more than one mesophile that was equally closely related or vice versa, the species pair was chosen with the most similar habitat, physiology, and genomic GC content. Where more than one strain of a species had been sequenced, the strain with the earliest published sequence was used. Nine phylogenetically independent pairs of mesophiles with thermophiles were identified (table 1); at the time the database was searched, there were no other mesophile–thermophile species pairs with published genomes that were phylogenetically independent of the nine used here.
Table 1

Species Pairs Used in This Study

SpeciesToptGCGenome Reference
Sulfurovum sp. NBC37-13343.8Nakagawa et al. (2007)
Nitratiruptor sp. SB155-25539.7Nakagawa et al. (2007)
Streptomyces avermitilis2670.7Omura et al. (2001)
Thermobifida fusca50–5567.5Lykidis et al. (2007)
Methanococcus maripaludis35–4033.1Hendrickson et al. (2004)
Methanocaldococcus jannaschii8531.4Bult et al. (1996)
Deinococcus radiodurans30–3767.0White et al. (1999)
Thermus thermophilus6869.4Henne et al. (2004)
Desulfitobacterium hafniense Y513747.4Nonaka et al. (2006)
Pelotomaculum thermopropionicum5553.0Kosaka et al. (2008)
Synechocystis sp. PCC68032647.7Kaneko et al. (1996)
Thermosynechococcus elongatus5553.9Nakamura et al. (2002)
Bacillus subtilis25–3543.5Kunst et al. (1997)
Geobacillus kaustophilus6052.1Takami et al. (2004)
Clostridium tetani3728.7Bruggeman et al. (2003)
Thermoanaerobacter tengcongensis7537.6Bao et al. (2002)
Methanosphaera stadtmanae36–4027.6Fricke et al. (2006)
Methanothermobacter thermautotrophicus65–7049.5Smith et al. (1997)

NOTE.—GC, GC content of the major chromosome (excluding plasmids and extrachromosomal elements). Topt and GC from the NCBI Genome Project database, except Topt for Sulfurovum and Nitratiruptor (Nakagawa et al. 2007); Desulfitobacterium (Suyama et al. 2001), Geobacillus (Takami et al. 2004), and Synechocystis (growth temperature recommended by the American Type Culture Collection). Topt, optimum growth temperature.

Species Pairs Used in This Study NOTE.—GC, GC content of the major chromosome (excluding plasmids and extrachromosomal elements). Topt and GC from the NCBI Genome Project database, except Topt for Sulfurovum and Nitratiruptor (Nakagawa et al. 2007); Desulfitobacterium (Suyama et al. 2001), Geobacillus (Takami et al. 2004), and Synechocystis (growth temperature recommended by the American Type Culture Collection). Topt, optimum growth temperature.

Identification and Alignment of Homologous Proteins

For seven of the mesophile–thermophile pair of species, the Entrez Gene Plot function (http://www.ncbi.nlm.nih.gov/sutils/geneplot.cgi) was used to obtain a list of reciprocal best matches of protein sequences. Each list was sorted, and where a sequence from one species had multiple best matches from the other species (which can happen when there are multiple identical protein sequences), all but one of the matching pairs were deleted. Proteins encoded by small extrachromosomal elements in Methanocaldococcus jannaschii or plasmids in the other species were deleted. For the Pelotomaculum thermoproprionicum versus Desulfitobacterium hafniense and Nitratiruptor versus Sulfurovum comparisons, Geneplot was not available. I therefore used Blast to obtain a list of the best match for each protein sequence in the other species and then sorted the two lists in a spreadsheet to identify the reciprocal best matches. No attempt was made to eliminate proteins whose genes may have been acquired recently by horizontal gene transfer (HGT). Whether a gene could be identified as acquired through HGT would depend on how divergent the source species was and whether its sequences were available; therefore, painstaking investigation of each gene would only result in eliminating some, but not all, such genes. Leaving genes acquired through HGT in the data set would tend to obscure patterns of consistent substitutional asymmetry by introducing noise into the data rather than creating patterns by statistical artifacts that would not be there otherwise. The complete set of protein sequences was downloaded from Entrez Genome for each species, and a Pascal program was written to use the list of reciprocal best matches, create a file for each pair of protein sequences, extract the protein sequences, and put them in the appropriate files. Each pair of protein sequences was aligned using ClustalW (Chenna et al. 2003), with the default parameters. Protein pairs with less than 35% identical sites and proteins less than 20 amino acids long were deleted. Ambiguously aligned sites adjacent to gaps were then omitted, with the omitted sites extending from the gap to the nearest pair of adjacent sites that were both identical in the two sequences, using the program AmbiguityRemover. The number of unambiguously aligned sites exhibiting each of the 190 possible pairwise patterns of difference was then counted using the program AsymmetryCounter. Both programs are available for download from http://udel.edu/∼mcdonald/asymmetry.html.

Statistical Analysis

For each pair of amino acids in each pair of species, the exact binomial test (for N < 1,000; McDonald 2009, p. 24–32) or G-test of goodness-of-fit (for N > 1,000; McDonald 2009, p. 46–51) was used to test the significance of the deviation from the expected 1:1 ratio. To distinguish between asymmetry resulting from genomic GC differences and asymmetry due to other causes, the LOGISTIC procedure of SAS (SAS Institute 2009) was used to perform logistic regression for each pair of amino acids, with the difference in genomic GC content between the thermophile and the mesophile as the independent variable and the proportion of substitutions in one direction as the dependent variable. Logistic regression (McDonald 2009, p. 247–255) finds the best-fitting equation of the form ln[Y/(1 − Y)] = a + bX, where Y is the probability of obtaining a particular value of a nominal variable for a given value of the measurement variable, a is the intercept, b is the slope, and X is the value of the measurement variable. For example, the logistic regression equation for the amino acids histidine and tyrosine (fig. 1) predicts the probability (Y) that a histidine/tyrosine site has histidine in the mesophile and tyrosine in the thermophile for any value of X, the difference in GC content between two species. The significance of the slope was used to test whether there was a significant relationship between the difference in GC content and the pattern of asymmetry. The significance of the intercept was used to test whether the predicted asymmetry for a mesophile–thermophile pair with equal GC contents was significantly different from the 1:1 ratio expected under the neutral model of molecular evolution.
F

Example of logistic regression of substitutional asymmetry and difference in GC content. GCtherm − GCmeso, the percent difference in GC content between the thermophile and the mesophile in each species pair. Hmeso → Ythermo, the proportion of sites in each species pair that have histidine in the mesophile and tyrosine in the thermophile, as a proportion of all aligned sites that have histidine in one species and tyrosine in the other. Error bars are 95% confidence intervals of the binomial proportion. The solid line is the logistic regression line, given by solving ln[Y/(1 − Y)] = a + bX for Y, where Y is Hmeso → Ythermo, X is GCtherm − GCmeso, a is the intercept, and b is the slope. The dashed line shows the estimation of the expected asymmetry in a species pair with zero difference in GC content.

Example of logistic regression of substitutional asymmetry and difference in GC content. GCtherm − GCmeso, the percent difference in GC content between the thermophile and the mesophile in each species pair. Hmeso → Ythermo, the proportion of sites in each species pair that have histidine in the mesophile and tyrosine in the thermophile, as a proportion of all aligned sites that have histidine in one species and tyrosine in the other. Error bars are 95% confidence intervals of the binomial proportion. The solid line is the logistic regression line, given by solving ln[Y/(1 − Y)] = a + bX for Y, where Y is Hmeso → Ythermo, X is GCtherm − GCmeso, a is the intercept, and b is the slope. The dashed line shows the estimation of the expected asymmetry in a species pair with zero difference in GC content. To identify amino acids that deviated from the overall pattern in particular species pairs, the residual (difference between the observed proportion of substitutions in one direction and the proportion predicted by the logistic regression model) was calculated for each amino acid pair in each species pair and then averaged across the 19 pairs involving each amino acid. For this analysis, the proportion of sites with the target amino acid in the thermophile and the other amino acid in the mesophile was used.

Amino Acid Properties

The logistic regression equation for each pair of amino acids was used to predict the expected proportion of substitutions in each direction in a hypothetical species pair that did not differ in GC content. These predicted proportions were multiplied by the total number of substitutions across the nine species pairs for that amino acid pair to yield a synthetic data set. The AAindex list of amino acid indices (Kawashima et al. 2008) was downloaded from http://www.genome.ad.jp/dbget/aaindex.html. Indexes that measure the propensity of amino acids to occur in particular proteins or parts of proteins were deleted, as were those with missing or estimated values. For each index, the difference between the values of the index for each pair of amino acids was used as the independent variable in a simple logistic regression. The dependent variable was taken from the synthetic data set, the expected number of substitutions in each direction in a species pair that does not differ in GC content.

Results

Extensive Substitutional Asymmetry Related to Difference in GC Content

There is extensive substitutional asymmetry; of the 1,710 total comparisons (190 pairs of amino acids in nine species pairs), 1,038 are significantly (P < 0.05) different from the expected 1:1 ratio (supplementary table 1, Supplementary Material online). Each of the 190 pairs of amino acids is significantly asymmetrical in at least one of the nine species pairs, and 125 of the pairs of amino acids are asymmetrical in at least five species pairs. Some of the asymmetry is associated with differences in GC content. Of the 190 pairs of amino acids, 153 differ in average GC content of their codons (e.g., histidine [H] has an average of 1.5 GC in its codons [CAC, CAT] vs. tyrosine [Y], which has an average of 0.5 GC in its codons [TAC, TAT]). The logistic regression of substitutional asymmetry versus difference in genome-wide GC content has a significant slope for 122 out of these 153 pairs of amino acids (supplementary table 2, Supplementary Material online), indicating that the proportion of substitutions in each direction depends on the difference in genome-wide GC content. Figure 1 shows an example of this; the proportion of H ↔ Y sites with H in the mesophile and Y in the thermophile decreases for species pairs in which the thermophile has greater GC than the mesophile. Of the 37 amino acid pairs with no difference in average GC content of their codons, 15 have a significant slope. Of the 122 pairs of amino acids with differing average GC content and significant slopes, 114 are in the expected direction: sites with the GC-rich amino acid in the mesophile and the GC-poor amino acid in the thermophile become less common in the species pairs where the thermophile has higher genome-wide GC content than the mesophile (supplementary table 2, Supplementary Material online). Seven of the eight pairs of amino acids that show the opposite pattern involve methionine. Sites with aspartic acid, cysteine, glutamic acid, glutamine, leucine, serine, or threonine in the mesophile and methionine in the thermophile become more common as the thermophile–mesophile GC difference increases, even though the methionine codon has a slightly smaller GC content than the codons for the other amino acids. The logistic regression for 139 out of 190 pairs of amino acids had a significant intercept (supplementary table 2, Supplementary Material online), meaning that a mesophile–thermophile species pair with no difference in genomic GC content would be expected to have significant asymmetry. The intercept of each logistic regression was used to estimate the substitutional asymmetry predicted for a mesophile–thermophile pair with no difference in GC content (table 2). The average of the 19 intercepts for each amino acid gives a measure of how strongly that amino acid is preferred in mesophiles or thermophiles; for example, only 41.6% of the substitutions involving serine would have serine in the thermophile and some other amino acid in the mesophile (table 3).
Table 2

The Substitutional Asymmetry Predicted for a Mesophile–Thermophile Pair with No Difference in GC Content, Based on the Intercept of the Logistic Regression of Asymmetry Versus Difference in GC Content

SN0.508DG0.509GC0.529HK0.517KP0.558*
SD0.521*DQ0.558*GV0.553*HC0.537KY0.547*
ST0.542*DM0.565*GI0.546*HV0.554*CV0.579*
SG0.482*DH0.578*GF0.564*HI0.578*CI0.504
SQ0.561*DE0.574*GL0.589*HF0.581*CF0.526
SM0.569*DA0.547*GR0.608*HL0.554*CL0.539*
SH0.562*DK0.561*GW0.555HR0.559*CR0.446*
SE0.582*DC0.576*GP0.601*HW0.598*CW0.487
SA0.593*DV0.565*GY0.603*HP0.591*CP0.492
SK0.603*DI0.538QM0.536*HY0.632*CY0.520
SC0.590*DF0.619*QH0.556*EA0.479*VI0.507*
SV0.610*DL0.571*QE0.518*EK0.513*VF0.502
SI0.609*DR0.622*QA0.511EC0.554VL0.511*
SF0.607*DW0.693*QK0.516*EV0.505VR0.555*
SL0.610*DP0.618*QC0.598*EI0.515VW0.512
SR0.624*DY0.653*QV0.538*EF0.542*VP0.540*
SW0.641*TG0.460*QI0.584*EL0.518*VY0.538*
SP0.604*TQ0.521*QF0.588*ER0.550*IF0.490
SY0.676*TM0.511QL0.581*EW0.571*IL0.522*
ND0.500TH0.554*QR0.579*EP0.566*IR0.536*
NT0.546*TE0.553*QW0.631*EY0.574*IW0.506
NG0.502TA0.545*QP0.610*AK0.520*IP0.521
NQ0.549*TK0.563*QY0.644*AC0.453*IY0.529*
NM0.576*TC0.523MH0.500AV0.522*FL0.512*
NH0.611*TV0.595*ME0.522AI0.500FR0.526
NE0.545*TI0.607*MA0.517AF0.526*FW0.498
NA0.552*TF0.580*MK0.540*AL0.536*FP0.517
NK0.587*TL0.591*MC0.537AR0.571*FY0.500
NC0.594*TR0.607*MV0.574*AW0.525LR0.513
NV0.629*TW0.604*MI0.583*AP0.605*LW0.515
NI0.612*TP0.611*MF0.596*AY0.546*LP0.505
NF0.593*TY0.619*ML0.607*KC0.532LY0.508
NL0.626*GQ0.529*MR0.556*KV0.487RW0.576*
NR0.650*GM0.514MW0.569*KI0.503RP0.503
NW0.624*GH0.548*MP0.628*KF0.541*RY0.549*
NP0.604*GE0.544*MY0.617*KL0.501WP0.551
NY0.685*GA0.561*HE0.493KR0.599*WY0.495
DT0.508GK0.543*HA0.494KW0.623*PY0.482

NOTE.—The number is the predicted proportion of sites with the first amino acid in the mesophile and the second amino acid in the thermophile; an asterisk indicates that the proportion is significantly different from 0.50 (P < 0.05). Amino acids are ordered from least preferred (serine, S) to most preferred (tyrosine, Y) in thermophiles.

Table 3

Average Asymmetry and Transfer Free Energy for Each Amino Acid

Amino AcidAverage AsymmetryTransfer Free Energy
Serine (S)0.4160.04
Asparagine (N)0.417−0.01
Aspartic acid (D)0.4300.54
Threonine (T)0.4500.44
Glycine (G)0.4510.00
Glutamine (Q)0.459−0.10
Methionine (M)0.4701.30
Histidine (H)0.4851.10
Glutamic acid (E)0.4970.55
Alanine (A)0.5000.73
Lysine (K)0.5041.50
Cysteine (C)0.5230.70
Valine (V)0.5291.69
Isoleucine (I)0.5312.97
Phenylalanine (F)0.5422.65
Leucine (L)0.5442.49
Arginine (R)0.5510.73
Tryptophan (W)0.5623.00
Proline (P)0.5652.60
Tyrosine (Y)0.5752.97

NOTE.—Average asymmetry is the predicted proportion, in a pair of species with equal GC contents, of substitutions from other amino acids in the mesophile to the given amino acid in the thermophile. Transfer free energy is from Simon (1976). Amino acids are ordered from least preferred (serine) to most preferred (tyrosine) in thermophiles.

The Substitutional Asymmetry Predicted for a Mesophile–Thermophile Pair with No Difference in GC Content, Based on the Intercept of the Logistic Regression of Asymmetry Versus Difference in GC Content NOTE.—The number is the predicted proportion of sites with the first amino acid in the mesophile and the second amino acid in the thermophile; an asterisk indicates that the proportion is significantly different from 0.50 (P < 0.05). Amino acids are ordered from least preferred (serine, S) to most preferred (tyrosine, Y) in thermophiles. Average Asymmetry and Transfer Free Energy for Each Amino Acid NOTE.—Average asymmetry is the predicted proportion, in a pair of species with equal GC contents, of substitutions from other amino acids in the mesophile to the given amino acid in the thermophile. Transfer free energy is from Simon (1976). Amino acids are ordered from least preferred (serine) to most preferred (tyrosine) in thermophiles.

Consistency among Pairs of Species

The residual (the difference between the observed asymmetry and that predicted by the logistic regression) was calculated for each pair of amino acids in each species pair, and the average residual was calculated for each amino acid in each species pair. In some species pairs, the average residual for some amino acids is quite a bit larger or smaller than expected (fig. 2). For example, in the StreptomycesThermobifida species pair, there are fewer sites with lysine (K) in the thermophile and other amino acids in the mesophile than predicted by the logistic regression, whereas there are more such sites than predicted in the DeinococcusThermus species pair. Out of 180 average residuals (20 amino acids in nine species pairs), 98 have a 95% confidence interval that does not include 0.
F

Mean of the 19 residuals (differences between the observed number of substitutions and that expected from the logistic regression) for each amino acid in each species pair. Values above 0 indicate that sites with that amino acid in the thermophile and other amino acids in the mesophile are more common than expected from the logistic regression of all species. Error bars are 95% confidence intervals.

Mean of the 19 residuals (differences between the observed number of substitutions and that expected from the logistic regression) for each amino acid in each species pair. Values above 0 indicate that sites with that amino acid in the thermophile and other amino acids in the mesophile are more common than expected from the logistic regression of all species. Error bars are 95% confidence intervals. After removing indices with missing or estimated values, and indices that represent frequencies in different parts of proteins, the AAindex database (Kawashima et al. 2008) contains 238 measures of biochemical and physical properties of amino acids. Treating the difference in each index for each of the pairs of amino acids as 190 values causes all kinds of statistical problems with nonindependence, so the results of the logistic regression of substitutional asymmetry versus index differences should be viewed as an exercise in data exploration not hypothesis testing. The strongest relationship between the difference in amino acid index and the predicted substitutional asymmetry is with transfer free energy (Simon 1976), a measure of hydrophobicity. In general, amino acids with higher transfer free energy tend to be substituted at high temperatures for amino acids with lower transfer free energy (fig. 3). However, differences in transfer free energy do not explain all the substitutional asymmetry. Of 139 pairs of amino acids with a significant intercept in the logistic regression (meaning that the substitutional asymmetry is predicted to be significant for a mesophile–thermophile pair with no difference in genome-wide GC content), 14 have the opposite pattern: the amino acid with lower transfer free energy is found more often at higher temperatures. The next strongest associations are with several other measures of hydrophobicity (Zimmerman et al. 1968; Jones 1975; Argos et al. 1982; Takano and Yutani 2001), all of which are highly correlated with transfer free energy.
F

Substitutional asymmetry (proportion of all A ↔ B sites that have A in the mesophile and B in the thermophile) versus the difference in transfer free energy of the amino acids (B-A), where B is the amino acid with greater transfer free energy.

Substitutional asymmetry (proportion of all A ↔ B sites that have A in the mesophile and B in the thermophile) versus the difference in transfer free energy of the amino acids (B-A), where B is the amino acid with greater transfer free energy.

Discussion

Each of the nine mesophile–thermophile species pairs exhibits a large amount of substitutional asymmetry; for most pairs of amino acids, there are more homologous sites with one amino acid in the mesophile and the other amino acid in the thermophile than the opposite. Substitutional asymmetry has been previously observed in small numbers of proteins from Methanococcus versus Methanocaldococcus (Haney et al. 1999; McDonald et al. 1999), Bacillus versus Geobacillus (McDonald et al. 1999), and Deinococcus versus Thermus (McDonald 2001). Here, I use translated protein sequences from the entire genomes of these species pairs and add six additional mesophile–thermophile pairs from a broad variety of habitats. Differences in genome-wide GC contents are one cause of substitutional asymmetry; all the species pairs used here differ to some degree in GC content, and it has long been known that amino acids with GC-rich codons are more common in species with GC-rich genomes (Lobry 1997; Singer and Hickey 2000). It is not clear whether differences in genome-wide GC content are caused by selection or mutational bias (Rocha and Danchin 2002; Lind and Andersson 2008), and it is not clear to what extent increased habitat temperatures cause increased GC contents (Musto et al. 2006; Wang et al. 2006). What is clear is that any attempt to identify selection on amino acids as a cause of substitutional asymmetry must remove the effects of GC content. Here, logistic regression modeling is used to control statistically for the effects of differing GC content, with the difference in GC content as the independent variable and the direction of substitution as the dependent variable. For the majority of amino acid pairs, the logistic regression predicts that a mesophile–thermophile pair of species that did not differ in GC content would exhibit extensive substitutional asymmetry. The significant intercepts in the logistic regression models mean that the preferences for one amino acid over another are fairly consistent across the nine pairs of species. Substitutional asymmetry in one mesophile–thermophile pair could be caused by any number of habitat differences; for example, the mesophile Methanococcus maripaludis was isolated from a salt marsh (Jones, Paynter, and Gupta 1983), whereas the thermophile M. jannaschii was originally isolated from a deep-sea vent 2,600 m below the ocean surface (Jones, Leigh, et al. 1983). A difference in hydrostatic pressure may favor some amino acids over others (Di Giulio 2005); if hydrostatic pressure were an important selective factor, M. maripaludis and M. jannaschii would have patterns of asymmetry different from the other mesophile–thermophile pairs, which do not differ in the hydrostatic pressure of their habitats. The consistency of the patterns of asymmetry across species pairs suggests that much of the asymmetry results from selection caused by the different habitat temperatures. Although the patterns of asymmetry are consistent enough across species pairs to produce logistic regression models with significant intercepts, the amounts of asymmetry in each species pair are not exactly as predicted by the logistic regression; many amino acids are favored more or less strongly in some species pairs than would be expected. The optimal temperatures of the species pairs differ by different amounts, from 15 to 55 °C, so it would have been startling if they all exhibited the exact same amount of asymmetry. The species pairs differ in how recently they diverged from a common ancestor, and the species pairs also vary in other aspects that may affect selection on amino acid use: aerobic versus anaerobic; autotrophic versus heterotrophic; marine, freshwater, and terrestrial; and deep sea versus shallow water. Species pairs in which the ancestral species was thermophilic and one lineage then adapted to lower temperatures may show different patterns of temperature adaptation than species pairs in which the ancestor was mesophilic and one lineage adapted to higher temperatures (Berezovsky and Shakhnovich 2005). There is also increasing evidence that biosynthetic costs may affect amino acid use (Akashi and Gojobori 2002; Seligmann 2003; Heizer et al. 2006; Swire 2007), and the costs of particular amino acids will depend on factors that may be unrelated to temperature, such as the biosynthetic pathways used (for autotrophs) and environmental availability and uptake costs (for heterotrophs). Including all the possibly relevant variables when there are only nine species pairs would result in a logistic model that was completely overdetermined, with many spurious correlations; separating the substitutional asymmetry caused by temperature adaptation from the asymmetry resulting from other causes will require examining the genomes of a much larger number of mesophile–thermophile species pairs than currently available. These results show that amino acids with greater hydrophobicity (higher transfer free energy) tend to be preferred in thermophiles, which is consistent with several earlier studies (Argos et al. 1979; Gromiha, Oobatake, Kono, et al. 1999; Haney et al. 1999; Tekaia et al. 2002; Nakashima et al. 2003; Sadeghi et al. 2006; Berezovsky et al. 2007). There are, however, numerous exceptions to this rule. This is consistent with previous research that has failed to identify a single physicochemical property of the amino acids that would explain all the differences in amino acid abundance between mesophiles and thermophiles (Böhm and Jaenicke 1994; Zhou et al. 2008). One possible explanation is that thermal adaptation of amino acids is based on complicated tradeoffs between different properties (Gromiha, Oobatake, and Sarai 1999). Another possibility is that the cost of synthesizing amino acids plays a major role; the relative synthesis costs of amino acids change as temperatures increase (Amend and Shock 1998), and amino acids with lower synthesis costs tend to be more abundant, even in heterotrophs (Swire 2007). Values for the cost of synthesis of each amino acid in each species at a variety of temperatures are not available; as this information accumulates, it may become possible to understand the role that relative biosynthetic costs of amino acids play in temperature adaptation of proteins. There are numerous reports of charged amino acids being more common in thermophiles than in mesophiles (Cambillau and Claverie 2000; Das and Gerstein 2000; Szilágyi and Závodszky 2000; Fukuchi and Nishikawa 2001; Vielle and Zeikus 2001; Chakravarty and Varadarajan 2002; Tekaia et al. 2002; Nakashima et al. 2003; Suhre and Claverie 2003; Sadeghi et al. 2006; Berezovsky et al. 2007). That pattern is not apparent here; of 47 significant intercepts in the logistic regression involving one charged amino acid (arginine, aspartic acid, glutamic acid, and lysine) and one noncharged amino acid, 24 have the charged amino acid becoming more common in the thermophiles, but 23 have the charged amino acid becoming less common in the thermophiles (table 2). If histidine, which is weakly charged at physiological pH, is included in the charged amino acids, the result is the same: Of 57 significant intercepts, 28 have the charged amino acid becoming more common in the thermophiles, but 29 have the charged amino acid becoming less common in the thermophiles. Most of the studies reporting increased proportions of charged amino acids in thermophiles have relied heavily on hyperthermophiles, which have optimum growth temperatures of 85 °C to >100 °C, whereas the nine species pairs used here include only one hyperthermophile, M. jannaschii, with an optimum growth temperature of 85 °C. It may be that increasing the overall proportion of charged amino acids is only an important adaptation at very high temperatures.

Supplementary Material

Supplementary tables 1 and 2 are available at Genome Biology and Evolution online (http://www.oxfordjournals.org/our_journals/gbe/).
  67 in total

1.  Important amino acid properties for enhanced thermostability from mesophilic to thermophilic proteins.

Authors:  M M Gromiha; M Oobatake; A Sarai
Journal:  Biophys Chem       Date:  1999-11-15       Impact factor: 2.352

2.  Patterns of temperature adaptation in proteins from Methanococcus and Bacillus.

Authors:  J H McDonald; A M Grasso; L K Rejto
Journal:  Mol Biol Evol       Date:  1999-12       Impact factor: 16.240

3.  The stability of thermophilic proteins: a study based on comprehensive genome comparison.

Authors:  R Das; M Gerstein
Journal:  Funct Integr Genomics       Date:  2000-05       Impact factor: 3.410

4.  Effective factors in thermostability of thermophilic proteins.

Authors:  M Sadeghi; H Naderi-Manesh; M Zarrabi; B Ranjbar
Journal:  Biophys Chem       Date:  2005-10-25       Impact factor: 2.352

5.  Physics and evolution of thermophilic adaptation.

Authors:  Igor N Berezovsky; Eugene I Shakhnovich
Journal:  Proc Natl Acad Sci U S A       Date:  2005-08-24       Impact factor: 11.205

6.  Inferring parameters shaping amino acid usage in prokaryotic genomes via Bayesian MCMC methods.

Authors:  Hugo Naya; Daniel Gianola; Héctor Romero; Jorge I Urioste; Héctor Musto
Journal:  Mol Biol Evol       Date:  2005-09-14       Impact factor: 16.240

7.  A complete sequence of the T. tengcongensis genome.

Authors:  Qiyu Bao; Yuqing Tian; Wei Li; Zuyuan Xu; Zhenyu Xuan; Songnian Hu; Wei Dong; Jian Yang; Yanjiong Chen; Yanfen Xue; Yi Xu; Xiaoqin Lai; Li Huang; Xiuzhu Dong; Yanhe Ma; Lunjiang Ling; Huarong Tan; Runsheng Chen; Jian Wang; Jun Yu; Huanming Yang
Journal:  Genome Res       Date:  2002-05       Impact factor: 9.043

8.  Genome sequence of the radioresistant bacterium Deinococcus radiodurans R1.

Authors:  O White; J A Eisen; J F Heidelberg; E K Hickey; J D Peterson; R J Dodson; D H Haft; M L Gwinn; W C Nelson; D L Richardson; K S Moffat; H Qin; L Jiang; W Pamphile; M Crosby; M Shen; J J Vamathevan; P Lam; L McDonald; T Utterback; C Zalewski; K S Makarova; L Aravind; M J Daly; K W Minton; R D Fleischmann; K A Ketchum; K E Nelson; S Salzberg; H O Smith; J C Venter; C M Fraser
Journal:  Science       Date:  1999-11-19       Impact factor: 47.728

9.  Metabolic efficiency and amino acid composition in the proteomes of Escherichia coli and Bacillus subtilis.

Authors:  Hiroshi Akashi; Takashi Gojobori
Journal:  Proc Natl Acad Sci U S A       Date:  2002-03-19       Impact factor: 11.205

10.  The genome sequence of Methanosphaera stadtmanae reveals why this human intestinal archaeon is restricted to methanol and H2 for methane formation and ATP synthesis.

Authors:  Wolfgang F Fricke; Henning Seedorf; Anke Henne; Markus Krüer; Heiko Liesegang; Reiner Hedderich; Gerhard Gottschalk; Rudolf K Thauer
Journal:  J Bacteriol       Date:  2006-01       Impact factor: 3.490

View more
  16 in total

1.  Rapid Bioinformatic Identification of Thermostabilizing Mutations.

Authors:  David B Sauer; Nathan K Karpowich; Jin Mei Song; Da-Neng Wang
Journal:  Biophys J       Date:  2015-10-06       Impact factor: 4.033

2.  Proteome-wide Analysis of Protein Thermal Stability in the Model Higher Plant Arabidopsis thaliana.

Authors:  Jeremy D Volkening; Kelly E Stecker; Michael R Sussman
Journal:  Mol Cell Proteomics       Date:  2018-11-06       Impact factor: 5.911

3.  Overlapping genes: a new strategy of thermophilic stress tolerance in prokaryotes.

Authors:  Deeya Saha; Arup Panda; Soumita Podder; Tapash Chandra Ghosh
Journal:  Extremophiles       Date:  2014-12-13       Impact factor: 2.395

4.  Average oxidation state of carbon in proteins.

Authors:  Jeffrey M Dick
Journal:  J R Soc Interface       Date:  2014-11-06       Impact factor: 4.118

Review 5.  Thermostable marine microbial proteases for industrial applications: scopes and risks.

Authors:  Noora Barzkar; Ahmad Homaei; Roohullah Hemmati; Seema Patel
Journal:  Extremophiles       Date:  2018-02-13       Impact factor: 2.395

6.  Uniquely localized intra-molecular amino acid concentrations at the glycolytic enzyme catalytic/active centers of Archaea, Bacteria and Eukaryota are associated with their proposed temporal appearances on earth.

Authors:  J Dennis Pollack; David Gerard; Dennis K Pearl
Journal:  Orig Life Evol Biosph       Date:  2013-05-29       Impact factor: 1.950

7.  Detection and characterisation of mutations responsible for allele-specific protein thermostabilities at the Mn-superoxide dismutase gene in the deep-sea hydrothermal vent polychaete Alvinella pompejana.

Authors:  Matthieu Bruneaux; Jean Mary; Marie Verheye; Odile Lecompte; Olivier Poch; Didier Jollivet; Arnaud Tanguy
Journal:  J Mol Evol       Date:  2013-04-23       Impact factor: 2.395

8.  Analysis of protein thermostability enhancing factors in industrially important thermus bacteria species.

Authors:  Benjamin Kumwenda; Derek Litthauer; Ozlem Tastan Bishop; Oleg Reva
Journal:  Evol Bioinform Online       Date:  2013-08-18       Impact factor: 1.625

9.  Consistent mutational paths predict eukaryotic thermostability.

Authors:  Vera van Noort; Bettina Bradatsch; Manimozhiyan Arumugam; Stefan Amlacher; Gert Bange; Chris Creevey; Sebastian Falk; Daniel R Mende; Irmgard Sinning; Ed Hurt; Peer Bork
Journal:  BMC Evol Biol       Date:  2013-01-10       Impact factor: 3.260

10.  A comparison of structural and evolutionary attributes of Escherichia coli and Thermus thermophilus small ribosomal subunits: signatures of thermal adaptation.

Authors:  Saurav Mallik; Sudip Kundu
Journal:  PLoS One       Date:  2013-08-05       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.