Literature DB >> 30999494

Influence of multiple-sequence-alignment depth on Potts statistical models of protein covariation.

Allan Haldane1, Ronald M Levy2.   

Abstract

Potts statistical models have become a popular and promising way to analyze mutational covariation in protein multiple sequence alignments (MSAs) in order to understand protein structure, function, and fitness. But the statistical limitations of these models, which can have millions of parameters and are fit to MSAs of only thousands or hundreds of effective sequences using a procedure known as inverse Ising inference, are incompletely understood. In this work we predict how model quality degrades as a function of the number of sequences N, sequence length L, amino-acid alphabet size q, and the degree of conservation of the MSA, in different applications of the Potts models: in "fitness" predictions of individual protein sequences, in predictions of the effects of single-point mutations, in "double mutant cycle" predictions of epistasis, and in 3D contact prediction in protein structure. We show how as MSA depth N decreases an "overfitting" effect occurs such that sequences in the training MSA have overestimated fitness, and we predict the magnitude of this effect and discuss how regularization can help correct for it, using a regularization procedure motivated by statistical analysis of the effects of finite sampling. We find that as N decreases the quality of point-mutation effect predictions degrade least, fitness and epistasis predictions degrade more rapidly, and contact predictions are most affected. However, overfitting becomes negligible for MSA depths of more than a few thousand effective sequences, as often used in practice, and regularization becomes less necessary. We discuss the implications of these results for users of Potts covariation analysis.

Entities:  

Mesh:

Substances:

Year:  2019        PMID: 30999494      PMCID: PMC6508952          DOI: 10.1103/PhysRevE.99.032405

Source DB:  PubMed          Journal:  Phys Rev E        ISSN: 2470-0045            Impact factor:   2.529


  41 in total

1.  The Protein Data Bank.

Authors:  H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

2.  A biophysical approach to transcription factor binding site discovery.

Authors:  Marko Djordjevic; Anirvan M Sengupta; Boris I Shraiman
Journal:  Genome Res       Date:  2003-11       Impact factor: 9.043

3.  PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments.

Authors:  David T Jones; Daniel W A Buchan; Domenico Cozzetto; Massimiliano Pontil
Journal:  Bioinformatics       Date:  2011-11-17       Impact factor: 6.937

4.  Direct-coupling analysis of residue coevolution captures native contacts across many protein families.

Authors:  Faruck Morcos; Andrea Pagnani; Bryan Lunt; Arianna Bertolino; Debora S Marks; Chris Sander; Riccardo Zecchina; José N Onuchic; Terence Hwa; Martin Weigt
Journal:  Proc Natl Acad Sci U S A       Date:  2011-11-21       Impact factor: 11.205

5.  Genomics-aided structure prediction.

Authors:  Joanna I Sułkowska; Faruck Morcos; Martin Weigt; Terence Hwa; José N Onuchic
Journal:  Proc Natl Acad Sci U S A       Date:  2012-06-12       Impact factor: 11.205

6.  Identification of direct residue contacts in protein-protein interaction by message passing.

Authors:  Martin Weigt; Robert A White; Hendrik Szurmant; James A Hoch; Terence Hwa
Journal:  Proc Natl Acad Sci U S A       Date:  2008-12-30       Impact factor: 11.205

7.  Adaptive cluster expansion for inferring boltzmann machines with noisy data.

Authors:  S Cocco; R Monasson
Journal:  Phys Rev Lett       Date:  2011-03-02       Impact factor: 9.161

Review 8.  Emerging methods in protein co-evolution.

Authors:  David de Juan; Florencio Pazos; Alfonso Valencia
Journal:  Nat Rev Genet       Date:  2013-03-05       Impact factor: 53.242

9.  Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models.

Authors:  Magnus Ekeberg; Cecilia Lövkvist; Yueheng Lan; Martin Weigt; Erik Aurell
Journal:  Phys Rev E Stat Nonlin Soft Matter Phys       Date:  2013-01-11

10.  Protein structure prediction from sequence variation.

Authors:  Debora S Marks; Thomas A Hopf; Chris Sander
Journal:  Nat Biotechnol       Date:  2012-11       Impact factor: 54.908

View more
  8 in total

1.  Epistasis and entrenchment of drug resistance in HIV-1 subtype B.

Authors:  Avik Biswas; Allan Haldane; Eddy Arnold; Ronald M Levy
Journal:  Elife       Date:  2019-10-08       Impact factor: 8.140

2.  Unique features of different classes of G-protein-coupled receptors revealed from sequence coevolutionary and structural analysis.

Authors:  Hung N Do; Allan Haldane; Ronald M Levy; Yinglong Miao
Journal:  Proteins       Date:  2021-10-09

3.  Mi3-GPU: MCMC-based Inverse Ising Inference on GPUs for protein covariation analysis.

Authors:  Allan Haldane; Ronald M Levy
Journal:  Comput Phys Commun       Date:  2020-04-17       Impact factor: 4.390

4.  GEMME: a simple and fast global epistatic model predicting mutational effects.

Authors:  Elodie Laine; Yasaman Karami; Alessandra Carbone
Journal:  Mol Biol Evol       Date:  2019-08-12       Impact factor: 16.240

5.  Remote homology search with hidden Potts models.

Authors:  Grey W Wilburn; Sean R Eddy
Journal:  PLoS Comput Biol       Date:  2020-11-30       Impact factor: 4.475

6.  Modeling Sequence-Space Exploration and Emergence of Epistatic Signals in Protein Evolution.

Authors:  Matteo Bisardi; Juan Rodriguez-Rivas; Francesco Zamponi; Martin Weigt
Journal:  Mol Biol Evol       Date:  2022-01-07       Impact factor: 16.240

7.  Limits to detecting epistasis in the fitness landscape of HIV.

Authors:  Avik Biswas; Allan Haldane; Ronald M Levy
Journal:  PLoS One       Date:  2022-01-18       Impact factor: 3.240

8.  The generative capacity of probabilistic protein sequence models.

Authors:  Francisco McGee; Sandro Hauri; Quentin Novinger; Slobodan Vucetic; Ronald M Levy; Vincenzo Carnevale; Allan Haldane
Journal:  Nat Commun       Date:  2021-11-02       Impact factor: 14.919

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.