Literature DB >> 25064567

CCMpred--fast and precise prediction of protein residue-residue contacts from correlated mutations.

Stefan Seemayer1, Markus Gruber1, Johannes Söding2.   

Abstract

MOTIVATION: Recent breakthroughs in protein residue-residue contact prediction have made reliable de novo prediction of protein structures possible. The key was to apply statistical methods that can distinguish direct couplings between pairs of columns in a multiple sequence alignment from merely correlated pairs, i.e. to separate direct from indirect effects. Two classes of such methods exist, either relying on regularized inversion of the covariance matrix or on pseudo-likelihood maximization (PLM). Although PLM-based methods offer clearly higher precision, available tools are not sufficiently optimized and are written in interpreted languages that introduce additional overheads. This impedes the runtime and large-scale contact prediction for larger protein families, multi-domain proteins and protein-protein interactions.
RESULTS: Here we introduce CCMpred, our performance-optimized PLM implementation in C and CUDA C. Using graphics cards in the price range of current six-core processors, CCMpred can predict contacts for typical alignments 35-113 times faster and with the same precision as the most accurate published methods. For users without a CUDA-capable graphics card, CCMpred can also run in a CPU mode that is still 4-14 times faster. Thanks to our speed-ups (http://dictionary.cambridge.org/dictionary/british/speed-up) contacts for typical protein families can be predicted in 15-60 s on a consumer-grade GPU and 1-6 min on a six-core CPU.
AVAILABILITY AND IMPLEMENTATION: CCMpred is free and open-source software under the GNU Affero General Public License v3 (or later) available at https://bitbucket.org/soedinglab/ccmpred.
© The Author 2014. Published by Oxford University Press.

Entities:  

Mesh:

Substances:

Year:  2014        PMID: 25064567      PMCID: PMC4201158          DOI: 10.1093/bioinformatics/btu500

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

Evolutionary pressure to maintain a stable protein structure gives rise to correlated mutations between contacting residue pairs. These correlated mutations can be observed in a multiple sequence alignment (MSA) of the protein family and can be used to predict residue–residue contacts. A recent breakthrough was achieved by applying methods from statistics and statistical physics that aim at disentangling direct couplings from mere correlations between MSA columns (Ekeberg ; Kamisetty ; Marks ; Weigt ). This has resulted in a boost in contact prediction accuracy, thanks to which it is now possible to reliably predict protein structures using only sequence information if enough homologous sequences are available (Hopf ; Marks ; Nugent and Jones, 2012). Currently, 30% of Pfam families meet a reasonable criterion of having three homologous sequences per residue in the chain (see Supplementary Section 7 for details). Modern contact prediction methods differ by their strategy in the disentangling step: the most accurate class of methods (Ekeberg ; Kamisetty ) such as plmDCA (Ekeberg ) and GREMLIN (Kamisetty ) learn the direct couplings as parameters of a Markov random field by maximizing its pseudo-likelihood, which has runtime complexity of O(NL2) where N is the number of homologous sequences in the MSA and L its number of columns. The less accurate methods based on sparse covariance matrix inversion such as PSICOV (Jones ) or Mean Field Direct Coupling Analysis use the sequence information only in a preprocessing step, while the main computation in O(L3) is independent of N. This makes them fast for short alignments (small L) but slow for large alignments. Whereas most protein families used for benchmarking so far have been relatively short, in practice, longer alignments are more relevant, for example, to predict interdomain or even interprotein contacts (Ovchinnikov ). Still, existing methods would take ∼29 CPU years to complete large-scale studies such as the computation of contact predictions for the 30% of Pfam with sufficient sequence coverage (see Supplementary Section 7).

2 RESULTS

CCMpred implements the approach taken in plmDCA and GREMLIN, which is based on maximizing the pseudo-likelihood of an L2-regularized Markov random field (see Supplementary Information for details). After successful optimization, the couplings are ranked by the Frobenius norms of the pairwise potentials and the average product correction (Dunn ) is applied to compute the final score. As explained in the Supplementary Information, the task of computing the gradient of the pseudo-likelihood represents an almost ideal use-case for GPUs, as the computations can be run efficiently in parallel on the thousands of GPU processors with little idling due to memory access limitations. We compare the runtimes and precisions of CCMpred with two other pseudo-likelihood maximization (PLM)-based tools, plmDCA (plmDCA-symmetric 3, plmDCA-asymmetric 1) and GREMLIN (version 2.01), and with the covariance matrix inversion-based tool PSICOV. The recently published FreeContact (Kaján ) software is much faster than PSICOV but clearly less accurate than plmDCA and GREMLIN and was not included here.

2.1 Precision

For benchmarking the precision of contact prediction methods, we use the same set of 150 Pfam families with sequences and high-resolution structures (1.9 Å) with identical input alignments as used in the PSICOV (Jones ) method. We rank the list of predicted contacts and determine the fraction of physical contacts (C distance Å) when selecting increasing numbers of contacts. Figure 1 shows that CCMpred is among the top tools.
Fig. 1.

Precision of contact prediction for increasing numbers of predicted pairs, normalized by length L of the target

Precision of contact prediction for increasing numbers of predicted pairs, normalized by length L of the target

2.2 Runtimes

For runtime benchmarks, we generated synthetic MSAs with 3000 sequences and 50, 100, … , 1000 columns (real alignments show similar speedups but exhibit more variance in their runtimes—see Supplementary Fig. S4 for details). Because GPUs and CPUs differ in their numbers of cores, frequency per core, etc., we attempt to make a fair comparison by comparing runtimes for hardware of similar price. We ran the GPU version of CCMpred on an NVIDIA GeForce GTX 780 Ti, all CPU-based methods on an Intel Xeon E5–2620 six-core processor. Alignments with L > 500 were run on a Tesla K40 GPU with 12 GB RAM (gray points). Figure 2 shows the runtime of the methods for increasing alignment length. PSICOV is the fastest CPU method for small L, as its runtime is independent of sequence count. However, for , CCMpred becomes faster than PSICOV for alignments with typical numbers of sequences (). At typical alignment lengths of L = 300, the CCMpred GPU code is 35 times faster than plmDCA, 113 times faster than GREMLIN and 16 times faster than PSICOV. On the same data, our CPU version is 4.3 times faster than plmDCA, 14 times faster than GREMLIN, 8.3 times slower than our GPU code and 2.0 times faster than PSICOV. For plmDCA and our CPU version, we use all six cores. PSICOV and GREMLIN do not support multi-threading and, therefore, ran on a single core. However, even if implementations with perfect scaling existed (dividing runtimes by six), GREMLIN would still not be as fast as the CPU version of CCMpred. Our GPU code would be faster than a parallelized PSICOV at L > 150, and our CPU code would be faster at L > 600.
Fig. 2.

Total runtimes on MSAs with 3000 sequences.

Total runtimes on MSAs with 3000 sequences.

3 CONCLUSION

CCMpred is a fast GPU and CPU implementation of a top-performing PLM-based contact prediction approach that runs in a fraction of the time of comparably accurate methods. The speed increase is particularly important for long proteins and large-scale applications. Because CCMpred is free and open-source software, we hope that it also can serve as a basis for further methods development in this field.
  10 in total

1.  PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments.

Authors:  David T Jones; Daniel W A Buchan; Domenico Cozzetto; Massimiliano Pontil
Journal:  Bioinformatics       Date:  2011-11-17       Impact factor: 6.937

2.  Accurate de novo structure prediction of large transmembrane protein domains using fragment-assembly and correlated mutation analysis.

Authors:  Timothy Nugent; David T Jones
Journal:  Proc Natl Acad Sci U S A       Date:  2012-05-29       Impact factor: 11.205

3.  Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction.

Authors:  S D Dunn; L M Wahl; G B Gloor
Journal:  Bioinformatics       Date:  2007-12-05       Impact factor: 6.937

4.  Identification of direct residue contacts in protein-protein interaction by message passing.

Authors:  Martin Weigt; Robert A White; Hendrik Szurmant; James A Hoch; Terence Hwa
Journal:  Proc Natl Acad Sci U S A       Date:  2008-12-30       Impact factor: 11.205

5.  Assessing the utility of coevolution-based residue-residue contact predictions in a sequence- and structure-rich era.

Authors:  Hetunandan Kamisetty; Sergey Ovchinnikov; David Baker
Journal:  Proc Natl Acad Sci U S A       Date:  2013-09-05       Impact factor: 11.205

6.  Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models.

Authors:  Magnus Ekeberg; Cecilia Lövkvist; Yueheng Lan; Martin Weigt; Erik Aurell
Journal:  Phys Rev E Stat Nonlin Soft Matter Phys       Date:  2013-01-11

7.  Three-dimensional structures of membrane proteins from genomic sequencing.

Authors:  Thomas A Hopf; Lucy J Colwell; Robert Sheridan; Burkhard Rost; Chris Sander; Debora S Marks
Journal:  Cell       Date:  2012-05-10       Impact factor: 41.582

8.  Protein 3D structure computed from evolutionary sequence variation.

Authors:  Debora S Marks; Lucy J Colwell; Robert Sheridan; Thomas A Hopf; Andrea Pagnani; Riccardo Zecchina; Chris Sander
Journal:  PLoS One       Date:  2011-12-07       Impact factor: 3.240

9.  Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information.

Authors:  Sergey Ovchinnikov; Hetunandan Kamisetty; David Baker
Journal:  Elife       Date:  2014-05-01       Impact factor: 8.140

10.  FreeContact: fast and free software for protein contact prediction from residue co-evolution.

Authors:  László Kaján; Thomas A Hopf; Matúš Kalaš; Debora S Marks; Burkhard Rost
Journal:  BMC Bioinformatics       Date:  2014-03-26       Impact factor: 3.169

  10 in total
  136 in total

1.  Synthetic protein alignments by CCMgen quantify noise in residue-residue contact prediction.

Authors:  Susann Vorberg; Stefan Seemayer; Johannes Söding
Journal:  PLoS Comput Biol       Date:  2018-11-05       Impact factor: 4.475

2.  Template-based and free modeling of I-TASSER and QUARK pipelines using predicted contact maps in CASP12.

Authors:  Chengxin Zhang; S M Mortuza; Baoji He; Yanting Wang; Yang Zhang
Journal:  Proteins       Date:  2017-11-14

3.  Deep-learning contact-map guided protein structure prediction in CASP13.

Authors:  Wei Zheng; Yang Li; Chengxin Zhang; Robin Pearce; S M Mortuza; Yang Zhang
Journal:  Proteins       Date:  2019-08-14

4.  Folding Membrane Proteins by Deep Transfer Learning.

Authors:  Sheng Wang; Zhen Li; Yizhou Yu; Jinbo Xu
Journal:  Cell Syst       Date:  2017-09-27       Impact factor: 10.304

5.  Distance-based protein folding powered by deep learning.

Authors:  Jinbo Xu
Journal:  Proc Natl Acad Sci U S A       Date:  2019-08-09       Impact factor: 11.205

6.  ResPRE: high-accuracy protein contact prediction by coupling precision matrix with deep residual neural networks.

Authors:  Yang Li; Jun Hu; Chengxin Zhang; Dong-Jun Yu; Yang Zhang
Journal:  Bioinformatics       Date:  2019-11-01       Impact factor: 6.937

Review 7.  Hybrid methods for combined experimental and computational determination of protein structure.

Authors:  Justin T Seffernick; Steffen Lindert
Journal:  J Chem Phys       Date:  2020-12-28       Impact factor: 3.488

8.  Assessing Predicted Contacts for Building Protein Three-Dimensional Models.

Authors:  Badri Adhikari; Debswapna Bhattacharya; Renzhi Cao; Jianlin Cheng
Journal:  Methods Mol Biol       Date:  2017

9.  CONFOLD: Residue-residue contact-guided ab initio protein folding.

Authors:  Badri Adhikari; Debswapna Bhattacharya; Renzhi Cao; Jianlin Cheng
Journal:  Proteins       Date:  2015-06-06

10.  Structure and Assembly of the Enterohemorrhagic Escherichia coli Type 4 Pilus.

Authors:  Benjamin Bardiaux; Gisele Cardoso de Amorim; Areli Luna Rico; Weili Zheng; Ingrid Guilvout; Camille Jollivet; Michael Nilges; Edward H Egelman; Nadia Izadi-Pruneyre; Olivera Francetic
Journal:  Structure       Date:  2019-05-02       Impact factor: 5.006

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.