| Literature DB >> 29950007 |
Abstract
Motivation: The relative rates of amino acid interchanges over evolutionary time are likely to vary among proteins. Variation in those rates has the potential to reveal information about constraints on proteins. However, the most straightforward model that could be used to estimate relative rates of amino acid substitution is parameter-rich and it is therefore impractical to use for this purpose.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29950007 PMCID: PMC6022633 DOI: 10.1093/bioinformatics/bty261
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Empirical models of protein sequence evolution
| Model | Training data | References |
|---|---|---|
| General models: | ||
| JTT | — | |
| LG | — | |
| PAM (Dayhoff) | — | |
| PMB | — | |
| VT | — | |
| WAG | — | |
| Specialized models: | ||
| HIVb | HIV (eight proteins) | |
| rtREV | retroelement |
Note: ‘—’ indicates that many protein MSAs were used for training. Many different methods were used to estimate matrix parameters. Only a selected subset of specialized models is shown; many specialized models were trained using viral data (e.g. FLU) or organelle-encoded proteins (e.g. mtREV24 and cpREV).
Fig. 1.Dividing amino acid interchanges into radical and conservative is difficult. Amino acids can be divided into many different groups; radical changes are those between groups whereas conservative changes are within groups. Dayhoff groups reflect patterns in their PAM matrix and their physicochemical properties. Hanada groups maximized the correlation between KR/KC and KA/KS for mammalian proteins. Many studies (e.g. Weber ) calculate KR/KC using a simple polar-nonpolar and/or large-small categorization. However, changes in many amino acid properties (i.e. any interchanges that cross lines in the diagram) can be radical, at least in some contexts. In fact, certain amino acids (C and P, shaded) have unique properties and any substitution involving them might be radical. Thus, radical versus conservative changes should be viewed as a matter of degrees rather than absolutes
Parameter estimates for single parameter eq3 models
| Dataset | Sites | V | P | C | A | G | G + T |
|---|---|---|---|---|---|---|---|
| Yeasts: | |||||||
| Flc2p | 494 | 4.86 | 4.87 | 3.33 | 2.73 | 3.25 | 3.05/0.65 |
| Ptc1p | 217 | 4.43 | 5.72 | 2.50 | 2.85 | 3.61 | 3.53/0.27 |
| Rfc2p | 296 | 4.69 | 4.79 | 3.22 | 3.71 | 3.07 | 2.90/0.54 |
| Ung1p | 185 | 3.71 | 4.39 | 1.99 | 3.22 | 2.08 | 1.89/0.48 |
| Tkl1p | 629 | 4.69 | 4.33 | 2.64 | 3.08 | 2.50 | 2.42/0.26 |
| Mean | 4.48 | 4.82 | 2.74 | 3.12 | 2.90 | 2.76/0.44 | |
| Birds: | |||||||
| APC (54) | 2862 | 3.51 | 4.88 | 2.57 | 3.76 | 7.14 | 6.56/1.03 |
| GFPT1 (15) | 700 | 2.60 | 4.21 | 3.43 | 2.07 | 3.88 | 3.70/0.42 |
| HMBS (76) | 353 | 3.54 | 4.16 | 3.12 | 1.53 | 5.52 | 5.25/1.09 |
| IFGN1 (78) | 845 | 3.34 | 4.75 | 3.12 | 1.53 | 7.07 | 7.21/1.06 |
| PCNX (79) | 2359 | 3.31 | 4.31 | 2.52 | 3.01 | 5.33 | 4.89/0.96 |
| Mean | 3.26 | 4.46 | 3.05 | 2.55 | 5.79 | 5.52/0.91 | |
| Vertebrates: | |||||||
| AQR | 1020 | 4.07 | 4.43 | 2.94 | 3.64 | 4.72 | 4.59/0.73 |
| COX10 | 490 | 3.13 | 4.06 | 2.57 | 4.15 | 4.70 | 4.54/0.69 |
| EDC4 | 761 | 4.22 | 5.43 | 2.80 | 4.02 | 4.52 | 4.43/0.58 |
| GPATCH1 | 515 | 3.52 | 4.74 | 3.01 | 3.92 | 4.12 | 3.97/0.67 |
| VPS54 | 347 | 3.94 | 4.91 | 2.40 | 4.37 | 4.52 | 4.24/0.77 |
| Mean | 3.78 | 4.71 | 2.74 | 4.02 | 4.52 | 4.36/0.69 |
Parameter estimates are rounded to the nearest 0.01. Estimates of the T parameter were only obtained in combination with G; those parameter estimates are listed in the order G/T. Bird gene numbers are from the study by Jarvis . Complete output of the parameter optimization program is available in Supplementary File S3.
Parameter estimates and ΔlnL/site for the best-fitting eq3 models
| Dataset | V | P | C | A | G | T | Δ | Best |
|---|---|---|---|---|---|---|---|---|
| Yeasts: | ||||||||
| Flc2p | 2.13 | 3.24 | — | 1.97 | 1.59 | 0.63 | 0.9621 | LG (−0.2062) |
| Ptc1p | 1.91 | 4.08 | — | 1.47 | 2.48 | — | 0.8091 | LG (−0.1132) |
| Rfc2p | 2.27 | 3.63 | — | 2.55 | 1.50 | 0.56 | 0.8790 | LG (−0.1692) |
| Ung1p | 1.85 | 3.51 | — | 2.66 | 0.72 | 0.35 | 0.7615 | rtREV (−0.1217) |
| Tkl1p | 2.65 | 2.91 | — | 1.71 | 1.08 | 0.24 | 0.6869 | LG (−0.2438) |
| Mean | 2.16 | 3.48 | 0.00 | 2.07 | 1.47 | 0.35 | ||
| Birds: | ||||||||
| APC | — | 2.66 | 0.55 | 2.72 | 5.95 | 0.89 | 0.7518 | HIVb (−0.0002) |
| GFPT1 | — | 2.91 | — | — | 3.36 | — | 0.0862 | JTT (−0.0183) |
| HMBS | 1.71 | 2.00 | — | — | 4.71 | 0.93 | 0.7319 | JTT (−0.0006) |
| IFGN1 | 0.55 | 2.40 | 0.55 | 1.53 | 6.46 | 0.83 | 2.9576 | HIVb (−0.0667) |
| PCNX | 0.91 | 2.08 | 0.66 | 1.78 | 4.14 | 0.86 | 0.2512 | HIVb (−0.0039) |
| Mean | 0.63 | 2.41 | 0.35 | 1.21 | 4.92 | 0.70 | ||
| Vertebrates: | ||||||||
| AQR | 1.33 | 2.56 | — | 2.54 | 3.67 | 0.57 | 0.8500 | JTT (−0.1015) |
| COX10 | — | 2.48 | — | 3.04 | 3.73 | 0.53 | 1.9200 | JTT (−0.1689) |
| EDC4 | 0.85 | 3.32 | — | 3.13 | 3.36 | 0.57 | 1.8728 | JTT (−0.1061) |
| GPATCH1 | 0.76 | 2.55 | 0.75 | 2.45 | 3.12 | 0.48 | 2.2052 | JTT (−0.2815) |
| VPS54 | 0.88 | 3.15 | −0.89 | 3.13 | 3.46 | 0.72 | 1.2291 | JTT (−0.0804) |
| Mean | 0.76 | 2.81 | −0.03 | 2.86 | 3.47 | 0.57 |
Note: ‘—’ indicates parameters that were not in the best-fitting eq3 model (based on the AIC). Any parameters absent from the best-fitting model were assumed to be zero when the mean was calculated. Parameter estimates are rounded to the nearest 0.01. ΔlnL is the likelihood difference per site (ΔlnL/site) relative to the F81-like model. ΔlnL/site is rounded to the nearest 0.0001. The best-fitting empirical model (‘Best EMP’) is followed by the ΔlnL/site relative to the best-fitting eq3 model. Complete output of the parameter optimization program is available in Supplementary File S3.
Parameter estimates for eq4 models using the best-fitting empirical model
| Dataset | V | P | C | A | G | T | Best |
|---|---|---|---|---|---|---|---|
| Yeasts: | |||||||
| Flc2p | — | — | — | — | 0.60 | — | LG (0.0098) |
| Ptc1p | — | 1.03 | — | — | 1.00 | −0.44 | LG (0.0550) |
| Rfc2p | 1.34 | — | — | — | — | — | LG (0.0164) |
| Ung1p | — | 1.06 | −1.06 | 1.13 | — | — | rtREV (0.0464) |
| Tkl1p | 0.92 | — | — | — | — | — | LG (0.0086) |
| Mean | 0.45 | 0.42 | −0.21 | 0.23 | 0.32 | −0.09 | |
| Birds: | |||||||
| APC | −0.26 | 0.97 | −0.63 | 1.38 | 1.55 | 0.25 | HIVb (0.0251) |
| GFPT1 | — | — | — | — | — | — | JTT (—) |
| HMBS | — | — | — | — | 2.57 | 0.56 | JTT (0.0834) |
| IFGN1 | 0.35 | 0.40 | — | −0.28 | 2.76 | — | HIVb (0.0381) |
| PCNX | — | 0.45 | — | 0.80 | — | — | HIVb (0.0027) |
| Mean | 0.02 | 0.36 | −0.13 | 0.38 | 1.18 | 0.16 | |
| Vertebrates: | |||||||
| AQR | 0.56 | — | — | 0.85 | 1.47 | — | JTT (0.0452) |
| COX10 | −0.56 | — | — | 1.11 | 1.56 | — | JTT (0.0961) |
| EDC4 | — | 1.01 | — | 1.71 | 1.11 | 0.20 | JTT (0.1488) |
| GPATCH1 | — | — | 0.63 | 0.52 | 0.92 | — | JTT (0.0605) |
| VPS54 | — | 0.86 | −1.22 | 1.54 | 1.16 | 0.30 | JTT (0.0858) |
| Mean | 0.00 | 0.38 | −0.12 | 1.15 | 1.25 | 0.10 | LG (−0.2062) |
Note: ‘—’ indicates parameters that were not in the best-fitting (based on the AIC) eq4 model. In all cases, the best-fitting empirical model was used as the ‘base model’ that provided the K values in eq3. Any parameters not present in the best-fitting model were assumed to be zero for calculating the mean. Parameter estimates are rounded to the nearest 0.01. The best-fitting empirical model is followed by the ΔlnL per site relative to that model (‘—’ indicates the empirical model was not improved using eq4). Complete output of the parameter optimization program is available in Supplementary File S3.
Parameter estimates for concatenated datasets
| Dataset | Sites | V | P | C | A | G | T |
|---|---|---|---|---|---|---|---|
| Eq3 models: | |||||||
| Chaperonins | 3970 | 2.68 | 3.87 | −0.20 | 1.86 | 0.74 | 0.19 |
| Clathrin | 2138 | 2.11 | 3.62 | −0.25 | 3.06 | 0.68 | 0.39 |
| DNA polymerase | 1782 | 2.19 | 2.99 | 0.53 | 2.15 | 1.03 | 0.16 |
| DNA replication | 2284 | 2.36 | 3.10 | 0.47 | 2.08 | 0.98 | 0.15 |
| Proteasome | 2474 | 2.43 | 3.18 | 0.21 | 2.44 | 0.76 | 0.18 |
| Ribosomal proteins | 11 586 | 2.23 | 2.92 | 0.29 | 2.28 | 0.62 | 0.10 |
| RNA polymerase | 3274 | 2.26 | 2.97 | 0.29 | 1.96 | 0.86 | 0.27 |
| Translation factors | 2045 | 2.13 | 3.21 | 0.32 | 2.36 | 0.80 | 0.21 |
| Mean | 2.30 | 3.23 | 0.21 | 2.27 | 0.81 | 0.21 | |
| Eq4 models: | |||||||
| Chaperonins | 3970 | 1.10 | 0.88 | −0.55 | — | −0.44 | −0.18 |
| Clathrin | 2138 | 0.49 | 0.74 | −0.65 | 1.11 | −0.54 | — |
| DNA polymerase | 1782 | 0.68 | — | — | 0.24 | −0.24 | −0.21 |
| DNA replication | 2284 | 0.89 | — | — | — | −0.21 | −0.21 |
| Proteasome | 2474 | 0.84 | 0.27 | — | 0.51 | −0.47 | −0.19 |
| Ribosomal proteins | 11 586 | 0.73 | −0.23 | — | 0.45 | −0.34 | −0.55 |
| RNA polymerase | 3274 | 0.54 | — | — | 0.25 | −0.31 | −0.11 |
| Translation factors | 2045 | 0.44 | — | — | 0.42 | −0.33 | −0.20 |
| Mean | 0.71 | 0.21 | −0.15 | 0.37 | −0.38 | −0.18 |
All parameter estimates reflect the Ecdysozoa tree. ‘—’ indicates parameters that were not in the best-fitting eq4 model (based on the AIC). Any parameters not present in the best-fitting model were assumed to be zero for calculating the mean. Parameter estimates are rounded to the nearest 0.01. The empirical model used for the eq4 models was always LG. ‘DNA replication’ refers to DNA replication licensing factors, i.e. the MCM family. Complete output of the parameter optimization program is available in Supplementary File S4.