| Literature DB >> 31792239 |
Francesca Rizzato1, Stefano Zamuner1, Andrea Pagnani2,3,4, Alessandro Laio5,6.
Abstract
We introduce a simple model that describes the average occurrence of point variations in a generic protein sequence. This model is based on the idea that mutations are more likely to be fixed at sites in contact with others that have mutated in the recent past. Therefore, we extend the usual assumptions made in protein coevolution by introducing a time dumping on the effect of a substitution on its surrounding and makes correlated substitutions happen in avalanches localized in space and time. The model correctly predicts the average correlation of substitutions as a function of their distance along the sequence. At the same time, it predicts an among-site distribution of the number of substitutions per site highly compatible with a negative binomial, consistently with experimental data. The promising outcomes achieved with this model encourage the application of the same ideas in the field of pairwise and multiple sequence alignment.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31792239 PMCID: PMC6888882 DOI: 10.1038/s41598-019-53958-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Conditional probability P(d) of observing a mutation d sites away from another mutation at the sequence identities s respectively 60–62%, 70–72%, 80–82% and 90–92%. Model (J = 0.02 and r0 = 0.0004) in orange and data in purple.
Figure 2Weighted fit of the normalized histogram of the number of substitutions k per site to a negative binomial distribution at various sequence identities. It returns the best-fit value for α displayed in the key. The rms of residuals of these fits are respectively, from top-left to bottom-right: 1.9, 2.1, 0.33 and 1.3.
Figure 3Panel (a): estimated α by the model described by Eq. 1 and by the two-class model as a function of the sequence identity. The value of α is estimated by fitting the number of substitutions per site to a negative binomial (Eq. 2). Panel (b): Estimate of α for five Pfam families obtained by FastTree-2[49] on subtrees characterized by different average sequence identities (procedure described in Materials and Methods).
Figure 4Panel (a): Substitution avalanches on Influenza Hemagglutinin from 1980 to 2015[44]. Panel (b): Avalanches of simulated substitutions (Eq. 1 with J = 0.02 and r0 = 0.0004) on an example structure (PDB 16pk, chain H). In both panels each cross or square represents one substitution which took place on the site corresponding to the x value and in the year corresponding to the y value. Gray crosses stand for either isolated substitutions or avalanches made by two substitutions. The squares label the remaining substitutions and are colored according to the avalanche to which they belong according to the procedure described in section Avalanches detection and data from influenza hemagglutinin. The colored regions highlight some of the avalanches, and are only guides for the eye. Notice that the same avalanche can be split in two or more regions along the sequence, since a contact can be present even between sites which are not close along the sequence.