Literature DB >> 27493860

Liquid-theory analogy of direct-coupling analysis of multiple-sequence alignment and its implications for protein structure prediction.

Abstract

The direct-coupling analysis is a powerful method for protein contact prediction, and enables us to extract "direct" correlations between distant sites that are latent in "indirect" correlations observed in a protein multiple-sequence alignment. I show that the direct correlation can be obtained by using a formulation analogous to the Ornstein-Zernike integral equation in liquid theory. This formulation intuitively illustrates how the indirect or apparent correlation arises from an infinite series of direct correlations, and provides interesting insights into protein structure prediction.

Entities: Chemical Disease Gene

Keywords: Ornstein-Zernike equation; molecular evolution; protein design; sequence profile

Year: 2015 PMID： 27493860 PMCID： PMC4736835 DOI： 10.2142/biophysico.12.0_117

Source DB: PubMed Journal: Biophys Physicobiol ISSN： 2189-4779

Protein multiple-sequence alignments (MSA) are a useful means to extract various and valuable information about protein families [1]. It is well recognized that the frequency of amino acid residues at each alignment site is a useful measure of its functional importance. It has also been suggested that correlation between distant sites along the sequence is a rich source of information about the structure and function of the protein families [2]. In fact, recent years have seen a significant advance in our understanding of the site-site correlation observed in MSA. Of particular importance is the development of direct-coupling analysis (DCA) and related methods [3-5]. Although the basic idea has been already suggested in the last century [6], it is only by the recent explosion of protein sequence data, in addition to theoretical development, that practical implementation of the idea was made possible. What DCA tells us is clear: The “apparent” correlation observed in a MSA is a result of “direct” correlations which are closely related to structural contacts. For example, if residues i and j are in physical contact (directly correlated), and so are residues j and k, then residues i and k may appear to be correlated even if they are not in contact. There are many variants of DCA today. A major one is based on the principle of maximum entropy [3], others are based on the graphical Gaussian model [4] or phylogenetic analysis [5]. All of these methods are good predictors of physical contacts between residues in native protein structures. In this Note, I derive the direct correlation based on a formulation that is analogous to the integral equation theory of simple liquids [7]. This formulation has an advantage in that it intuitively shows how apparent correlations are realized by an infinite series of direct correlations. Based on the analogy with the liquid theory, it may be possible to elaborate the theory of direct correlations in MSA. More importantly, the intuitive picture that the present analysis provides helps us examine the mechanism of protein structure prediction from a new perspective, which may in turn lead to the development of new methods based on novel principles.

Theory

A multiple-sequence alignment consisting of M (≫1) amino acid sequences and N alignment sites may be regarded as an M×N matrix of symbols. That is, each row represents an amino acid sequence including gap symbols and each column represents an alignment site. Let n,(a)=1 if the residue type a appears at the site i of the sequence k, otherwise let n,(a)=0. We first define the frequency n(a) of residue a at site i as Next, the correlation (covariance) between residue a at site i and residue b at site j is defined as For simplicity, we assume that there are a sufficient number of sequences so that these statistics can be computed sufficiently accurately, and also ignore the effect of the phylogenetic bias in a family of sequences. Another caveat is required when there are completely conserved sites in the case of which the columns and rows corresponding to those conserved sites are zero. We assume this problem is properly taken care of, for example, by adding pseudo-counts. The correlations as a whole can be regarded as a 21N×21N matrix by properly ordering residues and sites. Note that, since the equality holds for any sequence k, the matrix C is rank-deficient. Nevertheless, it can be made invertible by removing the rows and columns corresponding to the gap symbol, and hence the size of the matrix C is now 20N×20N, which is assumed in the following. Now we assume there exists a “direct correlation” D(a,b) between residue a at site i and residue b at site j, and the correlation C is a result of an infinite series of the direct correlations: By defining the diagonal matrix ρ(a,b)=n(a)δ,δ,, this equation is expressed as This matrix equation is analogous to the Ornstein-Zernike integral equation in the theory of simple liquids [7] and can be expressed as a diagram in Figure 1 (where the left-hand side represents H=C–ρ). By solving this equation for D, we have

Figure 1

A diagrammatic representation of Eq. 4 with H=C–ρ.

which is essentially equivalent to the result of the mean-field DCA derived by Morcos et al. [3] based on the Plefka expansion [8].

Discussion

While Morcos et al. [3] used direct correlations as pair-wise interactions between residues, direct correlations (in liquid theory) are generally different from interactions. In fact, the approach of Morcos et al. may be interpreted as the mean-spherical approximation [7] which is a particular closure condition for solving the Ornstein-Zernike equation. It may be interesting to investigate other choices of closure conditions such as those analogous to, for example, the Percus-Yevick (PY) or hypernetted-chain (HNC) approximations [7]. The HMSA closure [9] is another interesting possibility. By rearranging Eq. (6), we have This relation can be interpreted as a self-consistent condition (rather than a “definition”) for ρ when D is given, and shows how the position-specificity of residue frequencies depends on the entire context of a protein family and its structure. It is now widely accepted that sequence-based profile methods [10,11] are the best method for template-based structure prediction. Noting that the direct correlations well correspond to native contacts, Eq. (7) tells us that an infinite series of tertiary interactions are effectively convoluted into a sequence profile through the alignment of many evolutionarily related sequences. On the contrary, purely structure-based profile or threading methods [12], intuitively speaking, take into account only the first one or two terms in Eq. (4) where ρ in this case is position-independent. This may be a reason for the insufficient position-specificity, and hence the limited success, of purely structure-based profile methods. The present analysis also has an implication for template-free or de novo structure prediction. All template-free methods are based on some empirical energy or scoring functions (whether physicochemical or statistical) and suffer from the problem of a rugged energy landscape that leads to many suboptimal non-native structures. In the mean time, studies on protein folding have shown that the energy landscape of natural proteins is minimally frustrated and funnel-like. This property can be readily modeled by the Go-like potentials in which only the native contacts are stabilizing [13,14]. It is conjectured that natural proteins have been naturally selected to satisfy such property in the course of molecular evolution [13]. This observation suggests a way to improve structure prediction by improving protein sequence design. That is, an empirical energy function that can reproduce the sequence profiles of (natural) protein families in the (re)designing process (i.e., generating sequences compatible with a given native structure) [15,16] may be expected to realize the “correct” direct correlation and development of such an energy function may help improve structure prediction. Physicochemically, it is the sequence that determines the structure. Evolutionarily, however, it is the structure that molds the pattern of a family of sequences. The DCA sheds new light especially on the latter aspect of proteins by explicitly providing the relation between the observed correlation C (i.e., the pattern of sequences) and the direct correlation D (≈ physical contacts). I hope the present analysis help further clarify the meaning of this intricate relationship between protein sequences and structures.

9 in total

1. Native protein sequences are close to optimal for their structures.

Authors: B Kuhlman; D Baker
Journal: Proc Natl Acad Sci U S A Date: 2000-09-12 Impact factor: 11.205

Review 2. Theory of protein folding.

Authors: José Nelson Onuchic; Peter G Wolynes
Journal: Curr Opin Struct Biol Date: 2004-02 Impact factor: 6.809

3. PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments.

Authors: David T Jones; Daniel W A Buchan; Domenico Cozzetto; Massimiliano Pontil
Journal: Bioinformatics Date: 2011-11-17 Impact factor: 6.937

1 in total

1. Monte Carlo simulation of a statistical mechanical model of multiple protein sequence alignment.

Authors: Akira R Kinjo
Journal: Biophys Physicobiol Date: 2017-07-12

1 in total

Liquid-theory analogy of direct-coupling analysis of multiple-sequence alignment and its implications for protein structure prediction.

Theory

Discussion

1. Native protein sequences are close to optimal for their structures.

Review 2. Theory of protein folding.

3. PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments.

4. Direct-coupling analysis of residue coevolution captures native contacts across many protein families.

Review 5. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

6. Hidden Markov models in computational biology. Applications to protein modeling.

Review 7. Theoretical studies of protein folding.

8. Prediction of contact residue pairs based on co-substitution between sites in protein structures.

9. Computational protein design quantifies structural constraints on amino acid covariation.

1. Monte Carlo simulation of a statistical mechanical model of multiple protein sequence alignment.