Literature DB >> 24351710

Site-heterogeneous mutation-selection models within the PhyloBayes-MPI package.

Abstract

MOTIVATION: In recent years, there has been an increasing interest in the potential of codon substitution models for a variety of applications. However, the computational demands of these models have sometimes lead to the adoption of oversimplified assumptions, questionable statistical methods or a limited focus on small data sets.
RESULTS: Here, we offer a scalable, message-passing-interface-based Bayesian implementation of site-heterogeneous codon models in the mutation-selection framework. Our software jointly infers the global mutational parameters at the nucleotide level, the branch lengths of the tree and a Dirichlet process governing across-site variation at the amino acid level. We focus on an example estimation of the distribution of selection coefficients from an alignment of several hundred sequences of the influenza PB2 gene, and highlight the site-specific characterization enabled by such a modeling approach. Finally, we discuss future potential applications of the software for conducting evolutionary inferences.
AVAILABILITY AND IMPLEMENTATION: The models are implemented within the PhyloBayes-MPI package, (available at phylobayes.org) along with usage details in the accompanying manual.

Mesh：

Substances：
Codon

Year: 2013 PMID： 24351710 PMCID： PMC3967107 DOI： 10.1093/bioinformatics/btt729

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

There is growing interest for the use of codon substitution models in several contexts, including phylogenetic inference (Gil ), ancestral sequence reconstruction (Chang ) and the characterization of the selective effects of specific nonsynonymous mutations (Tamuri ). Focusing on the latter, Halpern and Bruno (1998) first showed how to devise a model that accounts for both global mutational features at the nucleotide level and site-specific selective constraints at the amino acid level. Although their approach was directed to the estimation of evolutionary distances, it was later recognized as enabling the estimation of distributions of selection coefficients from phylogenetic data (see Thorne , for a review of these developments). However, a serious issue with the Halpern and Bruno model, and some of the subsequent re-implementations (e.g.Tamuri ), lies in the use of site-specific parameters optimized to maximum likelihood estimates; such an approach induces the ‘infinitely many parameters trap’, in which each additional observation changes the form of the overall model (see Rodrigue, 2013). Yang and Nielsen (2008) devised some simpler homogeneous mutation-selection models, along with a likelihood ratio test aimed at evaluating the significance of codon usage bias. While statistically well-justified, the homogeneity of their mutation-selection models makes them biologically unsatisfying for the purpose of estimating distributions of selection coefficients. Subsequently, we proposed the use of a nonparametric approach based on the Dirichlet process, providing a flexible and statistically well-founded method to accommodating across-site heterogeneity of amino acid constraints (Rodrigue ). However, our proof-of-concept implementation only allowed for its application on very small data sets, and its rate-limiting Markov chain Monte Carlo (MCMC) updates on the Dirichlet process, based on a Chinese-restaurant approach, were not amendable to parallelization. Working with nucleotide and amino acid level substitution models, we recently developed PhyloBayes-MPI, which, among other speedup strategies, uses message-passing-interface (MPI) and a truncated stick-breaking representation of the Dirichlet process for parallelized updating (Lartillot ). Here, we have expanded PhyloBayes-MPI for the implementation of several types of codon substitution models, including the Dirichlet process-based site-heterogeneous mutation-selection approach. We illustrate how the software can now be used for efficient estimation of distributions of selection coefficients (scaled by the effective chromosomal population size), and discuss several future avenues that it enables.

2 METHODS

In the present application, the program is passed an alignment file of coding nucleotide sequences (of a length that is a factor of 3) and a corresponding tree topology file. The universal genetic code is assumed, but an alternative code can be specified (e.g. -mtvert for the vertebrate mitochondrial code). As with other models with PhyloBayes-MPI, the program uses K > 1 cores. At startup, the master core draws an initial model configuration from the prior, and broadcasts it using MPI to the K − 1 compute cores. All updates are data-augmentation–based, which are several orders of magnitude faster than pruning-based updates (de Koning ). Each iteration of the MCMC includes numerous updates on global parameters, performed by the master core, whereas compute cores perform several updates of a truncated stick-breaking representation of the Dirichlet process (see Lartillot , for details), and sample the data augmentations.

3 EXAMPLE

We present an example that we could not run with our previous implementation, consisting of 401 sequences of the PB2 gene of influenza, taken from Tamuri . With the present implementation, we obtained a sample of 1100 draws within ∼5 h, running on a 12 core (hyper-threaded), Intel i7-based workstation. Discarding the first 100 draws, we display the posterior mean distributions of selection coefficients (S) in Figure 1.

Fig. 1.

Posterior mean distribution of selection coefficients (S) for all types of events (a) and for nonsynonymous events (b) appearing at mutation-selection balance, with red histograms for mutations (mainly deleterious, hence negative values), and green histograms for substitutions (symmetrical about 0, given the mutation-selection balance). Panel (c) summarizes the first 100 site-specific distributions of nonsynonymous mutations Looking at panel a, we find that the highest-valued bin is that of synonymous events (S = 0), most mutations (red) are deleterious (S < 0) and most substitutions (green) are either neutral (synonymous) or nearly neutral. Whereas the ‘infinitely many parameters’ approach used by Tamuri led to the conclusion that most nonsynonymous mutations have , panel b shows that most have a selection coefficient between −10 and −2, with the mode situated at −5. The red distribution, however, seems more plausible than that obtained using a parametric site-specific approach (Rodrigue, 2013), which inferred most nonsynonymous events with S between 0 and −5. Panel c displays a site-specific assessment of the distributions of S for nonsynonymous mutations, focusing on the first 100 codons of the alignment. For graphical simplicity, we have added up the values over a few sets of classes. The class, in blue, is the most represented—as previously revealed from panel b—but some sites, e.g. codon 75, have almost as many nonsynonymous mutations in the nearly neutral class, , indicating that they are under less stringent evolutionary constraints than other sites.

4 FUTURE DIRECTIONS

Our implementation enables numerous potential applications. For instance, the last panel of Figure 1 suggests studies on the distributions of S over classes of sites. Distributions could even be generated for all possible types of nonsynonymous mutations at each site. The software could also serve in the development of approaches that explicitly incorporate structural features (e.g. Meyer and Wilke, 2013), and already includes mutation-selection models using finite mixtures, homogeneous versions and models based on a univariate factor on nonsynonymous rates (ω). Applications of these models will be the focus of future papers. Developed within the PhyloBayes-MPI package, the models we have implemented here inherit the ability to perform a variety of types of posterior predictive model assessments, cross-validation comparisons, ancestral sequence reconstruction, as well as phylogenetic inference per se. Much more work is needed in these contexts, to assess what insights may be gained from the mutation-selection framework, and from codon substitution models in general. The present application should help engage such work.

9 in total

1. Mutation-selection models of codon substitution and their use to estimate selective strengths on codon usage.

Authors: Ziheng Yang; Rasmus Nielsen
Journal: Mol Biol Evol Date: 2008-01-03 Impact factor: 16.240

2. Rapid likelihood analysis on large phylogenies using partial sampling of substitution histories.

Authors: A P Jason de Koning; Wanjun Gu; David D Pollock
Journal: Mol Biol Evol Date: 2009-09-25 Impact factor: 16.240

3. Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies.

Authors: A L Halpern; W J Bruno
Journal: Mol Biol Evol Date: 1998-07 Impact factor: 16.240

4. Mutation-selection models of coding sequence evolution with site-heterogeneous amino acid fitness profiles.

Authors: Nicolas Rodrigue; Hervé Philippe; Nicolas Lartillot
Journal: Proc Natl Acad Sci U S A Date: 2010-02-22 Impact factor: 11.205

5. On the statistical interpretation of site-specific variables in phylogeny-based substitution models.

Authors: Nicolas Rodrigue
Journal: Genetics Date: 2012-12-05 Impact factor: 4.562

6. PhyloBayes MPI: phylogenetic reconstruction with infinite mixtures of profiles in a parallel environment.

Authors: Nicolas Lartillot; Nicolas Rodrigue; Daniel Stubbs; Jacques Richer
Journal: Syst Biol Date: 2013-04-05 Impact factor: 15.683

7. Integrating sequence variation and protein structure to identify sites under selection.

Authors: Austin G Meyer; Claus O Wilke
Journal: Mol Biol Evol Date: 2012-09-12 Impact factor: 16.240

8. CodonPhyML: fast maximum likelihood phylogeny estimation under codon substitution models.

Authors: Manuel Gil; Marcelo Serrano Zanetti; Stefan Zoller; Maria Anisimova
Journal: Mol Biol Evol Date: 2013-02-23 Impact factor: 16.240

9. Estimating the distribution of selection coefficients from phylogenetic data using sitewise mutation-selection models.

Authors: Asif U Tamuri; Mario dos Reis; Richard A Goldstein
Journal: Genetics Date: 2011-12-29 Impact factor: 4.562

9 in total

29 in total

1. The relationship between dN/dS and scaled selection coefficients.

Authors: Stephanie J Spielman; Claus O Wilke
Journal: Mol Biol Evol Date: 2015-01-08 Impact factor: 16.240

2. A penalized-likelihood method to estimate the distribution of selection coefficients from phylogenetic data.

Authors: Asif U Tamuri; Nick Goldman; Mario dos Reis
Journal: Genetics Date: 2014-02-14 Impact factor: 4.562

3. Extensively Parameterized Mutation-Selection Models Reliably Capture Site-Specific Selective Constraint.

Authors: Stephanie J Spielman; Claus O Wilke
Journal: Mol Biol Evol Date: 2016-08-10 Impact factor: 16.240

4. Site-Specific Amino Acid Distributions Follow a Universal Shape.

Authors: Mackenzie M Johnson; Claus O Wilke
Journal: J Mol Evol Date: 2020-11-24 Impact factor: 2.395

5. Comparison of Microbiomes between Red Poultry Mite Populations (Dermanyssus gallinae): Predominance of Bartonella-like Bacteria.

Authors: Jan Hubert; Tomas Erban; Jan Kopecky; Bruno Sopko; Marta Nesvorna; Martina Lichovnikova; Sabine Schicht; Christina Strube; Olivier Sparagano
Journal: Microb Ecol Date: 2017-05-22 Impact factor: 4.552

6. Bioinformatics for the Origin and Evolution of Viruses.

Authors: Jiajia Chen; Yuxin Zhang; Bairong Shen
Journal: Adv Exp Med Biol Date: 2022 Impact factor: 2.622

7. Measuring evolutionary rates of proteins in a structural context.

Authors: Dariya K Sydykova; Benjamin R Jack; Stephanie J Spielman; Claus O Wilke
Journal: F1000Res Date: 2017-10-16

8. Detection and localization of Solitalea-like and Cardinium bacteria in three Acarus siro populations (Astigmata: Acaridae).

Authors: Jan Hubert; Jan Kopecky; Marta Nesvorna; M Alejandra Perotti; Tomas Erban
Journal: Exp Appl Acarol Date: 2016-08-08 Impact factor: 2.132

9. A Phylogenomic Framework to Study the Diversity and Evolution of Stramenopiles (=Heterokonts).

Authors: Romain Derelle; Purificación López-García; Hélène Timpano; David Moreira
Journal: Mol Biol Evol Date: 2016-08-10 Impact factor: 16.240

10. The origin and diversification of pteropods precede past perturbations in the Earth's carbon cycle.

Authors: Katja T C A Peijnenburg; Arie W Janssen; Deborah Wall-Palmer; Erica Goetze; Amy E Maas; Jonathan A Todd; Ferdinand Marlétaz
Journal: Proc Natl Acad Sci U S A Date: 2020-09-24 Impact factor: 12.779