| Literature DB >> 28348810 |
Alberto Pessia1, Yonatan Grad2,3, Sarah Cobey4, Juha Santeri Puranen5, Jukka Corander1.
Abstract
The recent growth in publicly available sequence data has introduced new opportunities for studying microbial evolution and spread. Because the pace of sequence accumulation tends to exceed the pace of experimental studies of protein function and the roles of individual amino acids, statistical tools to identify meaningful patterns in protein diversity are essential. Large sequence alignments from fast-evolving micro-organisms are particularly challenging to dissect using standard tools from phylogenetics and multivariate statistics because biologically relevant functional signals are easily masked by neutral variation and noise. To meet this need, a novel computational method is introduced that is easily executed in parallel using a cluster environment and can handle thousands of sequences with minimal subjective input from the user. The usefulness of this kind of machine learning is demonstrated by applying it to nearly 5000 haemagglutinin sequences of influenza A/H3N2.Antigenic and 3D structural mapping of the results show that the method can recover the major jumps in antigenic phenotype that occurred between 1968 and 2013 and identify specific amino acids associated with these changes. The method is expected to provide a useful tool to uncover patterns of protein evolution.Entities:
Keywords: data clustering; protein evolution; sequence analysis
Year: 2015 PMID: 28348810 PMCID: PMC5320600 DOI: 10.1099/mgen.0.000025
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Fig. 1.Temporal distribution of influenza A/H3N2 HA within each K-Pax2 cluster. Groups are sorted by sampling year of the earliest consensus sequence.
Fig. 2.Maximum-likelihood phylogenetic tree of influenza A/H3N2 HA. K-Pax2 clusters are denoted in the tree as different colours. The scale bar indicates the expected number of substitutions per site.
Fig. 3.Phylogeny of influenza A/H3N2 HA as a phylogeny of K-Pax2 clusters. Ancestors are defined as the minimum (average) genetic distance groups, at least 1 year older. Each cluster is labelled by its earliest consensus sequence. Highlighted clusters connecting the viruses observed in 1968 to the most recent ones are the ‘core’ clusters. The scale bar indicates the expected number of substitutions per site.
Fig. 4.Maximum-likelihood phylogenetic tree of influenza A/H3N2 HA, restricted to core cluster consensus sequences. The 23 strains are the core clusters’ earliest consensus sequences. The scale bar indicates the expected number of substitutions per site.
Fig. 5.HA1 chain characteristic sites and their changes across the 23 core clusters. Vertical grey bars indicate cases where the previous characteristic amino acid in the sequence position has not mutated to a new value. White in any position indicates that the amino acid is not determined as characteristic. All other colours correspond to specific amino acids. Abscissae indicate residues' position along the HA protein.
Unadjusted mutation rate estimates, as observed on the HA1 of influenza A/H3N2, by B cell epitope (BCE)
Rates have been estimated as , where is the total number of amino acid changes, l is the length of the region and t is the time difference in years between two clusters. Independence between sites and homogeneous rates per region are assumed.
| Year | A | B | C | D | E | Not BCE | HA1 global |
| 1972 | 0.0263 | 0.0227 | 0.0093 | 0.0122 | 0.0114 | 0.0025 | 0.0076 |
| 1976 | 0.0395 | 0.0341 | 0.0185 | 0.0305 | 0.0227 | 0.0013 | 0.0121 |
| 1977 | 0.1053 | 0.1364 | 0.0741 | 0.122 | 0.0909 | 0.005 | 0.0455 |
| 1979 | 0.0526 | 0.0682 | 0 | 0.0244 | 0 | 0 | 0.0106 |
| 1983 | 0.0132 | 0.0114 | 0.0093 | 0.0183 | 0 | 0.0013 | 0.0053 |
| 1988 | 0.0105 | 0.0273 | 0 | 0 | 0.0091 | 0 | 0.003 |
| 1989 | 0.0526 | 0 | 0.037 | 0 | 0.0909 | 0 | 0.0121 |
| 1992 | 0.0175 | 0.0758 | 0.0123 | 0.0081 | 0.0152 | 0 | 0.0091 |
| 1993 | 0.0526 | 0 | 0 | 0.0244 | 0 | 0 | 0.0061 |
| 1995 | 0.0263 | 0.0227 | 0.037 | 0.0366 | 0 | 0 | 0.0106 |
| 1996 | 0.3684 | 0.1818 | 0.0741 | 0.0732 | 0.1364 | 0 | 0.0576 |
| 2001 | 0 | 0.0091 | 0.0074 | 0 | 0.0091 | 0.003 | 0.0036 |
| 2002 | 0.0526 | 0.0909 | 0 | 0 | 0.0455 | 0.005 | 0.0152 |
| 2003 | 0 | 0.0909 | 0 | 0.0244 | 0 | 0 | 0.0091 |
| 2004 | 0.0526 | 0 | 0 | 0.0244 | 0 | 0 | 0.0061 |
| 2005 | 0 | 0.0455 | 0.037 | 0 | 0 | 0.005 | 0.0091 |
| 2006 | 0.0526 | 0 | 0 | 0 | 0 | 0 | 0.003 |
| 2007 | 0 | 0 | 0 | 0.0244 | 0 | 0 | 0.003 |
| 2009 | 0 | 0.0455 | 0.0185 | 0.0122 | 0 | 0 | 0.0061 |
| 2010 | 0 | 0.0455 | 0.1111 | 0 | 0 | 0.0101 | 0.0182 |
| 2012(a)* | 0.0263 | 0 | 0 | 0 | 0 | 0 | 0.0015 |
| 2012(b)* | 0.0526 | 0.0227 | 0 | 0 | 0 | 0 | 0.0045 |
| Global† | 0.0311 | 0.0351 | 0.0152 | 0.0161 | 0.0145 | 0.0014 | 0.0092 |
* Mutations since 2010.
† It is unknown which of the two 2012 co-circulating groups will go extinct. The global rate has been computed by arbitrarily choosing cluster 2012(a).
Fig. 6.Core clusters in antigenic space. Polygon shapes and sizes are dependent on the availability of inhibition assay data.