Literature DB >> 25810434

A statistical physics perspective on alignment-independent protein sequence comparison.

Amit K Chattopadhyay¹, Diar Nasiev¹, Darren R Flower².

Abstract

MOTIVATION: Within bioinformatics, the textual alignment of amino acid sequences has long dominated the determination of similarity between proteins, with all that implies for shared structure, function and evolutionary descent. Despite the relative success of modern-day sequence alignment algorithms, so-called alignment-free approaches offer a complementary means of determining and expressing similarity, with potential benefits in certain key applications, such as regression analysis of protein structure-function studies, where alignment-base similarity has performed poorly.
RESULTS: Here, we offer a fresh, statistical physics-based perspective focusing on the question of alignment-free comparison, in the process adapting results from 'first passage probability distribution' to summarize statistics of ensemble averaged amino acid propensity values. In this article, we introduce and elaborate this approach.

Entities: Chemical Disease Species

Mesh：

Year: 2015 PMID： 25810434 PMCID： PMC4514925 DOI： 10.1093/bioinformatics/btv167

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Determining the similarity between macromolecules is central to bioinformatics. While comparison of 3-dimensional macromolecular structures remains an active area, most work focuses on macromolecular sequences. From sequence similarity devolves much of our understanding of evolutionary homology and probable structural and functional relatedness, allowing sequences to be grouped in a meaningful way. It is the basis of inherited or inferred functional annotation, allowing us to deduce the broad function of proteins in newly sequenced genomes. Similarity is determined, almost exclusively, through the alignment of sequences as text. Textual sequence similarity is taken as a surrogate for common ancestry and, by extension, functional and structural similarity. Most approaches to protein sequence similarity use models of sequence evolution and compare amino-acid strings, searching for linear conservation of sequence. Sequence alignment, where equivalent or near-equivalent symbols are brought into register, has been investigated intensely for many decades (Altschul, 1991; Vinga and Almeida, 2003), and thus an enormous associated literature has accumulated. Typically, substitution matrices specify a score for aligning pairs of nucleotides or amino acids; in such matrices different amino acids or nucleotides score differently according to the potential likelihood that one will replace the other in a sequence. For amino acids, many matrices have been published, based on many rationales (Feng ; Taylor, 1986), including the genetic code and amino acid physico-chemical properties. Most commonly-used matrices are typically derived empirically from exhaustive comparison of known sequences or structures. The log-odds matrices (Schwartz and Dayhoff, 1978) derived from the PAM model of protein evolution (Dayhoff ) was, for many years, the most widely used. Statistical results indicate such matrices adopt an implicit ‘log-odds’ form, with a specific target distribution for aligned residue pairs. The sensitivity of protein sequence searching depends on the selection of appropriate substitution matrices (Henikoff and Henikoff, 1992; 1993; Pearson, 1995). BLOSUM, and other commonly-used matrices, constructed from particular sets of related proteins, are tailored to target frequencies reflecting implied standard background amino acid compositions. While probabilistic extensions to sequence alignment, such as profiles and Hidden Markov Models (HMMs), can capture position-specific variation in multiple alignments, and typically demonstrate significantly augmented sensitivity, all alignment methods remain prone to similar limitations. Compared to sequence alignment, alternative approaches, grouped together as alignment-free techniques, and first proposed by Blaisdell (1986), have not been investigated as thoroughly (Davies ). Extant methods fall into several groups; of these, perhaps the most explored approach focuses on sequence comparison based on the joint sub-word or k-tuple content of groups of sequences, and their analysis using increasingly sophisticated probabilistic statistics. Amongst other approaches, methods based on plotting so-called propensity scales (Nakai ) have enjoyed long-standing popularity; with scales mirroring one or more amino acid properties, such as hydrophobicity (Hopp and Woods, 1981) or electronegativity. Such scales abound: AAindex has collected 545 different published scales (Kawashima ). Others have summarized such data, producing, inter alia, three (Hellberg ) or five scales (Sandberg ; Venkatarajan and Braun, 2001). A propensity scale is a means to characterize numerically local sequence properties, usually plotting amino acid properties along a sequence. Plotting individual values is of little value however unless there is an obvious periodicity. More helpful is to average the values using a moving or sliding window: typically using a flat, symmetrical window of no more than 10–20 amino acids in length. A moving window can have several potential parameters: first, the scale chosen; second, the window length; third, the window shape; fourth, whether the window is symmetrical or unbalanced; and fifthly, how values are smoothed after averaging. Typically, windows are short, flat, and symmetrical about the central residue, which takes the averaged value. Several different window shapes have been suggested, corresponding to weighting each position independently. Independent of the exact parameters used to define the window function, various kinds of smoothing are available, including digital filtering, integral transforms (Fourier and Cosine), and wavelet analysis, each with their own characteristics. Smoothing seeks to reduce the random component of the initial value spectrum generated by windowing, with the high-frequency regions removed, leaving only dominant low-frequency modes. The value of individual propensity plots is limited. It works well, say, for predicting transmembrane regions within proteins, where peaks in the plot can correlate well with regions highly enriched in hydrophobic residues. Hitherto, it has proved difficult to interpret such plots other than by a peak-spotting. Several decades ago, most predictors were based on identifying maximally valued regions of sequences; essentially looking for peaks, or troughs, in some form of a propensity plot. Epitopes, such as immunological T- or B-cell epitopes (Deavin ; Hopp and Woods, 1981); loops and surface exposure (Dovidchenko ); and transmembrane helices (Sipos and von Heijne, 1993) were—and often still are—predicted this way. Propensity scales have also been used in QSAR studies, particularly those focusing on peptides (Hellberg ; Sandberg ). In what follows, we take a decisive step away from such analyses, using techniques drawn from statistical physics. Specifically, we use techniques from the rich literature of first passage probability distribution, sometimes referred to as the ‘persistence’ problem (Bush and Chattopadhyay, 2014; Derrida ; Majumdar, 1996; Bray, 2013; Redner, 2007); applying them for the first time to sequence analysis. Persistence analysis has found applications in many fields including, inter alia: stock market analysis (Ren, 2005), modeling immunological systems (Chattopadhyay and Burroughs, 2007), extremal value statistics pertaining to data degradation (Whitmore, 1986), modeling the population biology of HIV dynamics (Tuckwell and Wan, 2000), event detection time for mobile sensor networks (Inaltekin ), and supply chain optimization (Wakuta, 2000). In this article, we develop and analyze propensity data by modeling sequences as a time-series, estimating the scaling regime of a generalized probability density function (PDF) of a variable derived from the original propensity data structure. Our protocol enables us to abstract key features from propensity plots while remaining free of any text-based alignment scheme. We then apply this alignment-independent approach to the analysis of protein sequences, evaluating it as a potential means of automatically characterizing and clustering large numbers of sequences.

2 Persistence Analysis: Deriving Order Parameters from Protein Sequences

The idea underlying persistence analysis is simple. It relies on the nature of the PDF of stochastic time series data X(t): in our case, ensemble-averaged propensity scale data. The basic question asked is: what is the probability p(t) that the field X(t) has not changed sign up to time t, starting from an initial configuration X(t0)? An equivalent question would be: what is the probability p(t1,t2) that the field X(t) changes sign N times between t1 and t2 for X(t) > ? In our case, = X0 or the mean value of the time series data X(t). Depending on the nature of the ‘random walk’—in this case a protein sequence—this probability is given by p(t1,t2) ∼ (t1/t2)-; where the exponent ‘m’ assumes different values depending on whether this probability is calculated using data above or below a certain threshold. The threshold is typically the mean value (X0) of the stochastic data. Thus multiple ‘order parameters’ may emerge from the same description depending on the chosen value of ‘m’. Our approach utilizes ‘extremal value statistics’ to analyze sequence structure. This builds on a tacit first approximation that sequences—as represented by propensity plots—can be seen as being predominantly stochastic in structure, at least when viewed synoptically. At this stage, the focus is on a specific manifestation, that of the ‘first passage probability’ distribution around the mean value X0 of the data points X(t) where ‘t’ is our metaphor for the location of the data point in the time series-like sequence. What this ‘first passage statistics’ captures is the distribution of the ‘return times’ of the fluctuating ‘time series’ across the given threshold X(t) = X0. Figure 1 shows the ‘return time lengths’ from data points below the line X(t) = X0 to one above this line and then back again to X(t) < X0. As an example, t+(1) defines the first ‘plus’-type return time where ‘plus’ refers to the regime X(t) > X0 such that the time series starts from a point below this line (X(t) < X0) and after crossing this line returns back to the regime X(t) < X0. Similarly, t+(2) refers to the second such ‘plus-type’ return time; and so on.

Fig. 1.

An illustration of first passage probability distribution across a threshold X(t) = X0

An illustration of first passage probability distribution across a threshold X(t) = X0 t-(1) refers instead to the first ‘minus-type’ return time where the time series count starts at X(t) > X0, then crosses the threshold line in to the X(t) < X0 regime and finally returns to X(t) > X0. Our interest is in the probability distributions of the t+ and t- transitions. Such a probability distribution is achieved by calculating histograms of ‘return times’ above (t+) and below (t-) the line X(t) = X0. The histograms are then normalized to obtain the desired PDFs. We assume that sequences are, to a first approximation, ‘inherently random’. As long as we are within the stochastic realm, all prior statistical results, including that for long range correlated stochastic data, suggest the existence of universal power law exponents defining the first passage probability distribution statistics related to t+ and t- transitions. In our description, we will use m+ to identify the exponent for X(t) > X0 and m_ for the exponent characterizing X(t) < X0. We now focus on this two dimensional (m,i) plane where n = +,−.

3 Methods

Here, we characterize protein sequences using an alignment-free approach based on techniques of time series analysis commonly-used in statistical physics and elsewhere (Bray, 2013; Redner, 2007).

3.1 Plotting of propensity data

In general, a window could adopt any arbitrary shape; thus we may assume a generalized window will have this form: Within the window there will be m amino acids that are to be averaged upstream of the target residue, and will have n amino acids downstream, plus the value for the residue itself; all residues within the window will have independent and arbitrary coefficients (w) that will selectively weight individual positions within the travelling window. Averaging can be undertaken several times: values averaged over any particular iteration being the values generated by averaging in the previous iteration. We extracted 544 useable propensity values from AAIndex (Kawashima ). All scales were used. Sequences were converted to numerical profiles comprising propensity values. Using a window length of 7, one round of uniform smoothing was used per sequence. Each sequence behaved as intrinsically stochastic time series-like data. For each sequence processed, we generated 544 different smoothed profiles, corresponding to the 544 scales from AAindex. Each of these 544 datasets was arranged in N column vectors, where each column vector represented a sequence. The number of entries in each of these column vectors was identical. The block structure is a matrix M (N × 544) where each element of this matrix is a column vector.

3.2 Propensity analysis

For each sequence drawn from Pfam (Finn ), and for each of the 544 AAindex propensity scales, separate m+ and m- values were calculated from the corresponding t+ and t- data. For each sequence, the arithmetic means of 544 m+ and 544 m-_values were derived: this gave the ‘order parameter’ m = 0.5*(m+ + m-). This is akin to statistical ‘ensemble averaging’, including any cogent non-ergodicity of the ensemble. The order parameter m thus represents each protein sequence as a single number. The resulting m-values were then used to cluster the sequences using the scheme detailed below. The schematic algorithm for calculating the values m+ and m- value for each of N sequences is shown below: (m+, m-) calculated for each of the 544 column vectors of row i of the M-matrix Ensemble average (arithmetical mean) taken of (m+, m-) for all 544 vectors in row i of the M-matrix Steps 1–2 repeated N times, for all 544 vectors in each row, to generate N-sets of (m+, m-) Our ‘order parameter’ is thus the mean of the previously defined m+ and m- values obtained individually as ‘scaling exponents’ (‘persistence exponents’) of the t+ (or t_) PDFs. The ultimate objective of this analysis was a clear clustering of the protein sequences such that similar proteins fall into separate clusters that are defined by separate combinations of (m,i) values (coordinate location in the m – i space) where n = +, − and i = position of the protein sequence along the x-axis. Clustering follows the logic below: Plot (m+, m-) versus relevant sequence number to generate the phase diagram Grouping of (m+, m-) versus sequence number i in the phase diagram Starting from the phase diagram defined in 2, estimate dm = mi + 1 − mi, f = (1 + dm)*di between every two points representing each sequence. Plot f versus i using a first adjustable threshold (mean of the separation distances in the phase diagram plot) that separates out the large f's form the small f's. Use dendogram based MATLAB clustering protocol Since the dendogram generates more clusters than the system actually has, use a second threshold (standard deviation of the data points in the phase diagram plot) and repeat step 2. The clustering accuracy will be shown by the level of gradient equality (see Fig. 3)

Fig. 3.

Histograms of cluster sizes obtained from the clustering data in Figure 2 and plotted on a log-log scale. The solid line represents actual data (panel 1 of Fig. 2), the dotted line represents the result from the Matlab-inbuilt algorithm (panel 2 of Fig. 2)

Clustering data are then plotted sequentially to evaluate the size of successive clusters. The PDFs of cluster lengths exhibit power-law scaling. More importantly, the scaling exponents can be used to compare the clustering accuracy of the Matlab-inbuilt architecture. Comparing this to known clusters provides the probabilistic values of the two external parameters determining the eventual cluster quality.

3.3 Application to test cases

We extracted one arbitrary but representative sequence family from each superfamily in the Pfam database (version 27.0, March 2013, 14831 families; Finn ). Seed sequences were used in preference to final sequence sets, for reasons of reliability, since members were chosen by human experts without the involvement of potentially-unreliable automated methods. Downloaded alignments were converted to un-gapped sequence sets. Figure 3 is an accuracy-check of our algorithm, estimating the cluster size distribution D. The gradients of the dotted and dashed lines in the log-log plots (cluster PDF of size) in Figure 3 are compared with that of the solid line (actual data) in identifying the ‘best fit’ (least squares) structure that leads to the most optimized clustering scheme, as shown in Figure 2.

Fig. 2.

Clustering from the part of the Pfam dataset. Panel 1 represents the reference set of known clusters with groupings obtained from Pfam. Panel 2 is the clustering obtained using an inbuilt Matlab algorithm which uses only the number of clusters from Pfam (Panel 1) as input. The y-axis represents the average value of the order parameter m, while the x-axis represents the sequence. Separate clusters are shown using different colours Histograms of cluster sizes obtained from the clustering data in Figure 2 and plotted on a log-log scale. The solid line represents actual data (panel 1 of Fig. 2), the dotted line represents the result from the Matlab-inbuilt algorithm (panel 2 of Fig. 2) The distribution follows a power-law profile (D ∼ S-p) with an (super-diffusive) exponent p ∼ 1. The two (threshold) parameters are tuned to predict the optimal fit straight line from our algorithm parallel to that of the ‘known’ data from panel 1. To the best of our knowledge, no extant procedure has such high accuracy with only two tunable parameters. The Matlab-based dendogram algorithm used requires the number of clusters as input, but beyond that the resulting clustering, including the size and location of clusters, are independent and no assumptions are made or prior knowledge used. As expected, the location and size of resulting clusters is not 100% accurate, yet it is clear from Figure 2 that mismatches are generally marginal. We also calculated two widely-used quality indices as summary measures of the Pfam clustering. The cluster separation measure of Davies and Bouldin (1979) gave a value of 0.75032, while the Dunn compactness index was 0.03296. Both values are consistent with excellent overall clustering. The data in Figures 2 and 3 establish the robustness of our algorithm. Using sequence-data with different overall characteristics and number of independent clusters (shown in panel 1 of Fig. 2: our benchmark), our results show that irrespective of the low accuracy of the cluster distribution gradient value (‘p’), as clearly seen from the non-parallel lines of Figure 3, cluster identification, results in highly accurate discrimination of groups. In other work, concentrating on more readily-apparent similarity within single Pfam clans, we have found using our current minimal and unoptimized method similar reliability at a much finer granularity. Together, these results demonstrate this sequence representation is both consistent and discriminatory at both high and low granularity.

4 Discussion

The analysis of sequence similarity is the cornerstone upon which much of bioinformatics is built. Hitherto, alignment-based approaches have completely dominated work in this area, while other approaches, of which there are several (Davies ), have been examined with much less thoroughness. Yet the need for effective approaches, able to transcend the limitations of text matching, is clear, if not widely appreciated. The protocol described here takes propensity plots produced from the 544 scales in AAindex (Kawashima ), averages them, and generates a single value characteristic of a whole sequence. Our implementation is, in essence, an out-of-the-box application of existing results, with immense potential for future refinement. We have applied here a well-understood statistical method from stochastic mathematics, using it to interpret and identify independent clusters in protein sequence data. Our approach introduces a fundamentally new way to represent sequences which is nonetheless founded on the long-standing concept of propensity scales, and, capitalizing on features of this representation, we have used it to power a novel approach to clustering. This method is able to capture much of the overall structure of a propensity plot in a single but discriminatory and self-consistent value. Many properties of a protein are encoded within its sequence in a subtle and recondite manner not amendable to direct identification through sequence alignment or the recognition of characteristic sequence motifs. Likewise, the discovery of functionally-similar but sequence-distinct proteins may be frustrated by a lack of ostensive similarity to proteins of known provenance. In such a situation, alignment-dependent approaches may produce ambiguous results or fail completely. There are many examples where structural or functional similarities are readily apparent experimentally, yet are difficult—if not impossible—to detect from textual sequence alignment. Perhaps the most obvious are the so-called twilight and midnight zones (Rost, 1999). Most protein sequences will fold into unique 3-dimensional structures, and similar sequences will typically have similar structures. Sequence alignments can routinely distinguish between sequence pairs known to have similar or non-similar structures when the sequence identity is greater than >40%. This unequivocal signal becomes lost at 20–35% sequence identity: the so-called twilight zone. Alignment-methods often fail to align paired amino-acids correctly even with 20–30% identity. Structural alignment has revealed many, many examples of so-called structural superfamilies, where proteins with less than 10% identity nonetheless retain structural propinquity (Flower, 1993; Flower ). The average identity between all sequence pairs of related structures is 8–10%, and this marks the midnight zone, which is predominantly populated by protein structures that have become similar by convergent evolution. Thus, we need to apply our approach to a variety of both solved and unsolved problems to explore its value. Solved problems include searching within and beyond the twilight zone, and to classify and identify structural and functional relationships within it effectively, it is imperative to explore alternative approaches to pairwise similarity and BLAST statistics (Karlin and Altschul, 1990). Unsolved problems include clustering whole genomes where the result is unknown, and is complicated by the presence of multi domain proteins, internal repeats, etc.; the development of regression models using our order parameter values as descriptors; and a surrogate of sequence searching using order values to define similarity measures. In protein design there is a need to move beyond making piecemeal changes to extant sequences to identify wholly new sequences with new functions and structures. Effective alignment free approaches should allow us to address such issues, particularly through the development of properly grounded regression approaches to protein sequence analysis. There are many examples of such approaches: the proteochemometric analysis of protein sequences (van Westen ), the prediction of candidate vaccines (Doytchinova and Flower, 2007a,b), and the successful assignment of bacterial proteins to various subcellular locations (Sjöström ). Regression approaches typically require three viable components: an induction engine (based on multivariate statistics, such as PLS, or a machine learning algorithm), data to be modeled (which can be quantitative or categorical in nature), and an appropriate data representation. Data modeling methods have reached sufficient maturity, and data quality is constrained by its availability on a case-by-case basis, so increasingly it is the choice of data representation that is the crucial arbiter of success. This is especially true for protein regression, which lags far behind equivalent work for small molecules. Alignment-independent similarity measures, such as our representation, offer an interesting and seemingly productive avenue for achieving progress in this endeavor. To go beyond sequence motifs and profiles, HMMs and like methodology—and thus identify common function, structure, and evolution—new, distinct, yet complimentary, methods must be devised: alignment-free methods that can work with textual alignment to identity similarity manifest as shared structure and function. In this article, we have used results from statistical physics to address alignment-free comparison, adapting results from ‘first passage probability distribution’ to derive a single summary value able to differentiate sequence groups with high accuracy at several levels of granularity. This approach is potential highly robust being largely independent of fluctuation in the tunable parameters. We anticipate that this approach will ultimately take its place alongside textual alignment as a strongly complimentary method for sequence analysis, with many advantages compared to conventional techniques.

29 in total

1. Twilight zone of protein sequence alignments.

Authors: B Rost
Journal: Protein Eng Date: 1999-02

2. Global Persistence Exponent for Nonequilibrium Critical Dynamics.

Authors:
Journal: Phys Rev Lett Date: 1996-10-28 Impact factor: 9.161

Review 3. Alignment-free sequence comparison-a review.

Authors: Susana Vinga; Jonas Almeida
Journal: Bioinformatics Date: 2003-03-01 Impact factor: 6.937

4. Statistical comparison of established T-cell epitope predictors against a large database of human and murine antigens.

Authors: A J Deavin; T R Auton; P J Greaney
Journal: Mol Immunol Date: 1996-02 Impact factor: 4.407

5. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.

Authors: S Karlin; S F Altschul
Journal: Proc Natl Acad Sci U S A Date: 1990-03 Impact factor: 11.205

6. A measure of the similarity of sets of sequences not requiring sequence alignment.

Authors: B E Blaisdell
Journal: Proc Natl Acad Sci U S A Date: 1986-07 Impact factor: 11.205

7. Peptide quantitative structure-activity relationships, a multivariate approach.

Authors: S Hellberg; M Sjöström; B Skagerberg; S Wold
Journal: J Med Chem Date: 1987-07 Impact factor: 7.446

Review 8. Proteomic applications of automated GPCR classification.

Authors: Matthew N Davies; David E Gloriam; Andrew Secker; Alex A Freitas; Miguel Mendao; Jon Timmis; Darren R Flower
Journal: Proteomics Date: 2007-08 Impact factor: 3.984

9. VaxiJen: a server for prediction of protective antigens, tumour antigens and subunit vaccines.

Authors: Irini A Doytchinova; Darren R Flower
Journal: BMC Bioinformatics Date: 2007-01-05 Impact factor: 3.169

7 in total

1. High-Throughput Genotyping Technologies in Plant Taxonomy.

Authors: Monica F Danilevicz; Cassandria G Tay Fernandez; Jacob I Marsh; Philipp E Bayer; David Edwards
Journal: Methods Mol Biol Date: 2021

2. Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models.

Authors: Hani Z Girgis; Benjamin T James; Brian B Luczak
Journal: NAR Genom Bioinform Date: 2021-02-01