Literature DB >> 29700659

Mathematical Basis of Predicting Dominant Function in Protein Sequences by a Generic HMM-ANN Algorithm.

Abstract

The accurate annotation of an unknown protein sequence depends on extant data of template sequences. This could be empirical or sets of reference sequences, and provides an exhaustive pool of probable functions. Individual methods of predicting dominant function possess shortcomings such as varying degrees of inter-sequence redundancy, arbitrary domain inclusion thresholds, heterogeneous parameterization protocols, and ill-conditioned input channels. Here, I present a rigorous theoretical derivation of various steps of a generic algorithm that integrates and utilizes several statistical methods to predict the dominant function in unknown protein sequences. The accompanying mathematical proofs, interval definitions, analysis, and numerical computations presented are meant to offer insights not only into the specificity and accuracy of predictions, but also provide details of the operatic mechanisms involved in the integration and its ensuing rigor. The algorithm uses numerically modified raw hidden markov model scores of well defined sets of training sequences and clusters them on the basis of known function. The results are then fed into an artificial neural network, the predictions of which can be refined using the available data. This pipeline is trained recursively and can be used to discern the dominant principal function, and thereby, annotate an unknown protein sequence. Whilst, the approach is complex, the specificity of the final predictions can benefit laboratory workers design their experiments with greater confidence.

Entities: Chemical Disease Gene Species

Keywords: Algorithm; Artificial neural network; Dominant protein function; Hidden markov model; Subfamily

Mesh：

Substances：
Proteins

Year: 2018 PMID： 29700659 PMCID： PMC7250805 DOI： 10.1007/s10441-018-9327-x

Source DB: PubMed Journal: Acta Biotheor ISSN： 0001-5342 Impact factor: 1.774

Background

The reliable annotation of genomic data is dependent on the assignment of function to protein sequences. Much of this information is gleaned from the clustering of these with existing functional groups. The presence of experimentally available data is invaluable to this effort, and in its absence the same has to be inferred from sequence data. This decomposition, into a superset of distinct functions of its constituent members (superfamily, family), is the most critical step of any clustering schema. A superfamily, by definition consists of sequences with poor, if any, sequence identity, with the simultaneous presence of one or more common fold(s). Consider the enzymes that belong to the iron and 2-oxoglutarate (2OG) or -ketoglutarate (AKG) dependent dioxygenases (EC 1.14.11.x). The average inter-sequence identity of these enzymes , notwithstanding, the unifying features of these enzymes are the presence of a jelly-roll motif (Double strand -helix; DSBH), and the substrate hydroxylating triad of residues (Clifton et al. 2006; Hausinger 2004; Koehntop et al. 2005). However, the chemical nature of the cognate substrate(s) of these enzymes and/or the reactions differs substantially, and can form smaller clusters (Kundu 2012, 2015). Similarly, whilst the glycoside hydrolases (GHs 1-130; EC 3.2.1.x), comprise the larger set, plant GH9 endoglucanases can be further stratified into classes A, B, and C (Libertini et al. 2004; Lombard et al. 2014; Molhoj et al. 2002; Urbanowicz et al. 2007). Whilst, the spatial arrangement of atoms of members of a superfamily dictates their biological role, differential function in a family of sequences can be attributed to the presence (native, acquired) or absence (native, excised) of specific sequence segments (classes A, B, and C of the plant GH9 endoglucanase family) and/or a limited number of amino acid residues (desaturases, demethylases, and chlorinating enzymes of the 2OG dependent dioxygenase superfamily). These regions are rarely silent, and can influence the behavior of the protein product(s) in vivo. Thus, while enzyme catalysis is dependent on conserved amino acids that form its active site geometry, generic proteins possess protein–protein, DNA/RNA–protein, transmembrane (TM), localization signals, and protein anchor-membrane domains that can influence its function. Despite the significant reduction in the dimensions of the superset to these smaller clusters, the unambiguous assertion of dominant function, remains challenging. For example, prevailing literature suggests that class B GH9 endoglucanases are the dominant forms of this family, far exceeding class C enzymes; a finding that is based on similarity to a few reference sequences (Buchanan et al. 2012; Montanier et al. 2010; Xie et al. 2013). Quantitative analyses of the differences between catalytically relevant segments of these enzymes, however, suggests that putative class C enzymes may approximate those of class B members (Kundu and Sharma 2016). Sequence based classifiers of protein function can either be direct and deploy hidden markov models (HMMs), support vector machines (SVMs), and artificial neural networks (ANNs). Indirect indices of function range from domain comparison against existing databases such as the conserved domain database (CDD) of the national center for biotechnology information (NCBI), and the prediction of secondary structural elements (Cao et al. 2016; Frishman and Argos 1995; Kabsch and Sander 1983; Marchler-Bauer et al. 2015; Martin et al. 2005). SVMs, although exhaustive, mandates the presence of training sets with highly similar sequences (Cao et al. 2016; Frishman and Argos 1995; Martin et al. 2005). Profile HMMs (pHMMs), are global representations of a multiple sequence alignment (MSA), and encompass modular information using a system of threshold values. A major finding in work done previously, however, highlighted the insensitivity of the inclusion thresholds, despite, log-orders of difference in the E-values used (Kundu and Sharma 2016). ANNs, are weighted approximations of multiple inputs to a function, and introduce bias in their computations as a means of achieving convergence. The reduction, to a single output channel, implies that this value is intrinsically ill-conditioned with the final prediction depending on the quality of the input. The arguments vide supra, justify the use of multiple statistical methods to assign dominant function to a protein of uncertain function. A specific instance (prediction of enzyme catalysis) of this pipeline has been tested on available sequence data in sequenced green plants (Kundu and Sharma 2016). The work presented here is a detailed exposition of the mathematics that underlies the observed specificity and accuracy of a generic HMM–ANN algorithm in predicting dominant probable function in an unknown protein sequence. Detailed proofs for all the steps and the derivation of the unique intervals both, theoretical and observed that encompass the ANN predictions are presented and are meant to offer mechanistic insights into the process of integrating several statistical methods as well as the rigor that may ensue. In addition, the definition, analysis, and the numerical computation of bounds of the participating sets and intervals are discussed in context of selecting suitable datasets and dictating the architecture of the ANN deployed. Additionally, interesting mathematical results based on the Lebesgue outer measure are discussed along with its biological relevance.

Algorithm and Results

Data collation, pre-processing, and computational tools. Protein sequences with detailed and specific biochemical data (kinetics, structure, mutagenesis) are preferred for training the HMMs and the ANN, while the test dataset can comprise sequences with expression data, unannotated coding segments of sequenced genomes (open reading frames, ORFs), or sequences with putative function. An alignment generating tool (Structural Alignment of Proteins, STRAP; Clustal suite), and HMMER (downloadable or server-based) may be used for model building, analysis, database construction, and similarity studies (Finn et al. 2015; Gille et al. 2014; Sievers and Higgins 2014). A scripting language (R, PERL, Python, AWK) may be utilized to analyze the data and perform miscellaneous tasks such as tabulation and formatting. The specialized R-packages needed to implement the unsupervised (clustering; cluster, fpc) and supervised (ANN; nnet, neuralnet) machine learning tools utilized by this algorithm can be easily downloaded. Define and delineate the functions that an arbitrary protein sequence may be partitioned into. Utilize the clustering schema, i.e., primary , secondary , and tertiary , to group the raw HMM-scores . Whilst, the lower bounds of these are axiomatic (Defs. 1–3), the upper bounds may be inferred (Eqs. 1–4) (Table 1, Fig. 1b, c, d) (Kundu 2017). Briefly,

Table 1

Role of inputs in defining ANN architecture

\|A\|	\|B\|	\|D\|	H1	H2	H3
3	3	3	4	4	3
4	6	15	12	8	11
5	10	45	30	14	31
6	15	105	63	21	71
7	21	210	120	29	141
8	28	378	209	39	253
9	36	630	341	50	421
10	45	990	527	63	661
11	55	1485	782	77	991
12	66	2145	1119	93	1431
13	78	3003	1557	110	2003
14	91	4095	2112	128	2731
15	105	5460	2804	148	3641
16	120	7140	3655	169	4761
17	136	9180	4686	192	6121
18	153	11,628	5922	216	7753

A: Set of raw HMM scores of a protein sequence

B: Set of pairs of raw HMM scores of a protein sequence

D: Set of pairs-of-pairs of raw HMM scores of a protein sequence

H1:

H2:

H3:

Fig. 1

Generic algorithm for predicting dominant function in a protein sequence. a Steps needed to construct and validate the HMM–ANN algorithm on a well characterized training set. The datasets may be repeatedly sampled for parameter definition and model refinement. The final output is a set of high confidence bounds that is mapped and specific for each predicted function, b analysis of cardinality of various sets used in parameterization, c scatter plot between the number of predicted function and the pairs-of-pairs of modified HMM scores, and d relevance of cardinality of the superset of probable functions to the architecture of the ANN. Abbreviations: HMM, hidden markov model; ANN, artificial neural network; A, B, D, Sets of raw HMM scores; H1, H2, H3, Methods to compute number of nodes in the hidden layer of a 1:1:1 ANN

Role of inputs in defining ANN architecture A: Set of raw HMM scores of a protein sequence B: Set of pairs of raw HMM scores of a protein sequence D: Set of pairs-of-pairs of raw HMM scores of a protein sequence H1: H2: H3: Generic algorithm for predicting dominant function in a protein sequence. a Steps needed to construct and validate the HMM–ANN algorithm on a well characterized training set. The datasets may be repeatedly sampled for parameter definition and model refinement. The final output is a set of high confidence bounds that is mapped and specific for each predicted function, b analysis of cardinality of various sets used in parameterization, c scatter plot between the number of predicted function and the pairs-of-pairs of modified HMM scores, and d relevance of cardinality of the superset of probable functions to the architecture of the ANN. Abbreviations: HMM, hidden markov model; ANN, artificial neural network; A, B, D, Sets of raw HMM scores; H1, H2, H3, Methods to compute number of nodes in the hidden layer of a 1:1:1 ANN , Computed value of pairs-of-pairs of raw HMM scores of the mth sequence of the nth function; , Composite set of pairs-of-pairs of raw HMM scores of the mth sequence of the nth function; n, mth member of nth function; m, mth member of nth function; l, lth member of . Define and enumerate the list of full length sequences that best represents each of these functions. These sequences , then constitute the training dataset for each predicted function and must necessarily possess the recommended sequence suitability index (Kundu 2017). These could also be complemented with extant empirical data. Estimate the -value for each sequence (Fig. 1a) (Kundu 2017; Kundu and Sharma 2016).

Lemma

The computed value of a pair-of-pairs of raw HMM scores is numerically equivalent to its z-score, i.e.,

Proof

Define such that In the absence of an explicit assumption of a normal population, the mean and standard deviation are not independent, i.e., , It follows that The -values computed in Step 3 are then clustered, such that every cluster mean represents the centroid of a specific function . These are then compared . Since, the cluster means are derived from the sequence data their difference is expected to be trivial subsume Rearranging the terms and differentiating w.r.t Consider the arbitrary terms If Similarly, if From Eqs. (9) and (10), Substituting this value in (Eq. 7) Utilize the results in Step 4 in association with pre-computed values of the set of pairs-of-pairs for each sequence of each probable function and define the input and output channels to the artificial neural network (ANN) (Kundu and Sharma 2016): ζ, Computed value of pairs-of-pairs of raw HMM scores of the mth sequence of the nth function; λ, Weighted ζ-score computed by the ANN; , Composite set of pairs-of-pairs of raw HMM scores of the mth sequence of the nth function. Define the intervals unique to each probable set of functions that an unknown protein sequence may be assigned to. These could be estimated directly or determined empirically (Def. 3) (Kundu and Sharma 2016). , Centroid of nth cluster; t, Interval coefficient of upper tail of t-distribution; z, Interval coefficient of normal distribution; σ, Standard deviation of sample; , Size of nth cluster; m, mth member of nth cluster. Define the bounds of the search space by considering the countable union of the sequence of open and pairwise disjoint intervals (observed, expected) contained within the encompassing major interval . The size is then the outer Lebesgue measure of the encompassing interval. b, ; a, ; σ, Standard deviation of nth cluster; , Centroid of nth cluster. Validate ANN-predictions of dominant function for the training sequences. This could be: (a) an exhaustive cross validation of each sequence of each probable function , (b) performed on a distinct validation subset of the training sequences if the sample sizes are adequate , or (c) empirical using pre-defined criteria appropriate to the dataset examined such as (Kundu and Sharma 2016).

Discussion

Contribution of the Probability of Mapping the ANN-Prediction to a Distinct Partition

The effective prediction by the ANN of dominant function for an unknown protein sequence is dependent on it being unambiguously mapped to a single numerical interval whose centroids approximate the cluster means for that particular function. Consider the closed and bounded interval of length (Step 7; Eq. 17) and the following sequences of open and pairwise disjoint subintervals (Step 6): Consider the sequence of uniquely observed open and pairwise disjoint subintervals: Consider the covering of sequences of arbitrary open and pairwise disjoint intervals Rewriting, Despite the result in Eq. 21, and as . The probability of mapping each ANN-output to a distinct sub-partition (Eq. 22).

Theorem

The number of probable functions | A | for any defined interval is countably infinite.

Proof

Consider the aforementioned sets . Since, every probable function is modeled as an open and bounded interval with a centroid, and is dense in , we can always find an infinite number of rational numbers between any two real numbers, i.e., Rewriting these inequalities and continuing,

Relevance of Functional Constraints to Unambiguous Assignment of Dominant Function

The outlined protocol is expected to improve upon previous stratification attempts, both, in terms of biological relevance, as well as in the accuracy of predictions. The latter has been assessed in earlier work using the indices of precision (specificity) and recall (sensitivity) (Kundu and Sharma 2016). Whilst, the utility of collating biochemical data relevant to sequence clustering is unequivocal; the multitude of methods utilized imposes rigor in the schema. In particular, the use of the SSI (Step 2) in tandem with empirical data can refine the selection of training sequences such that -value for each relevant sequence is within one standard deviation of the centroid for a particular cluster and may even converge (Steps 3 and 4) (Kundu 2017). The Chi squared data (Step 4), too, can be utilized to modify this selection such that an outlier sequence can be edited at this stage as well. The ratio of the input and output channels is critical to accomplishing convergence in an ANN (Step 5) with multiple outputs, as is its determination of the number of hidden layers. In contrast, despite a single output’s risk at being ill-conditioned, the unbiased assignment of dominant function mandates its persistent use. Clearly, well partitioned (open, bounded, pairwise disjoint) intervals that encompass the inputs to the ANN are then a pre-requisite for efficacious prediction (Steps 6 and 7). The number of theoretical partitions (Steps 6 and 7), notwithstanding, the analysis suggests that the cardinality of the superset of probable functions that an unknown sequence may be partitioned into is important and must be considered (Step 1). Consider the following functions (Step 1) (Table 1, Fig. 1c) Clearly, and , Raw HMM scores of all probable functions for a sequence; , Composite set of pairs-of-pairs of raw HMM scores of the mth sequence of the nth function; τ, Probability of assigning a unique dominant function to a protein sequence. Prediction of dominant function by this integrated algorithm is also likely to be constrained at the ANN stage, wherein, a larger number of hidden neurons may not result in any additional information. Extant literature from clinical medicine, agriculture, and academia, that have utilized ANN-based predictors suggests that the upper limit for neurons/nodes in a back-propagation (BP) ANN with architecture is (Akbari Hasanjani and Sohrabi 2017; Kundu and Sharma 2016; Hawari and Alnahhal 2016; Teshnizi and Ayatollahi 2015; Shi et al. 2013; Yamamura et al. 2008; Zhou and Li 2007). This, in turn would imply a limit on the cardinality of the superset of all probable functions that a protein sequence might be expected to possess, i.e., (Table 1, Fig. 1d).

Concluding Remarks

The HMM–ANN based algorithm accurately predicts dominant biological function of an unknown protein sequence. The detailed mathematical treatment of the various steps of this algorithm not only offers insights into the origins of this specificity, but also highlights the mechanism of integrating multiple methods into a generic functional algorithm. Additionally, it may assist investigators in preparing a computationally feasible superset, of putative function for their sequence(s) of interest. The algorithm itself, can be adapted with little effort, and uses publically available software and tools. The coding, when needed is trivial and can be accomplished with ease. The computations are self explanatory, lucid, and can be readily comprehended by biologists. The existence of upper and lower bounds may impose constraints on the selection of features/probable functions that could characterize a protein sequence. However, careful curation, inclusion of empirical data, and strict thresholds could go a long way in broadening the utility of this generic HMM–ANN algorithm.

28 in total

1. Phylogenetic analysis of the plant endo-beta-1,4-glucanase gene family.

Authors: Emanuele Libertini; Yi Li; Simon J McQueen-Mason
Journal: J Mol Evol Date: 2004-05 Impact factor: 2.395

2. Structural organization and a standardized nomenclature for plant endo-1,4-beta-glucanases (cellulases) of glycosyl hydrolase family 9.

Authors: Breeanna R Urbanowicz; Alan B Bennett; Elena Del Campillo; Carmen Catalá; Takahisa Hayashi; Bernard Henrissat; Herman Höfte; Simon J McQueen-Mason; Sara E Patterson; Oded Shoseyov; Tuula T Teeri; Jocelyn K C Rose
Journal: Plant Physiol Date: 2007-08 Impact factor: 8.340

Review 3. The 2-His-1-carboxylate facial triad: a versatile platform for dioxygen activation by mononuclear non-heme iron(II) enzymes.

Authors: Kevin D Koehntop; Joseph P Emerson; Lawrence Que
Journal: J Biol Inorg Chem Date: 2005-03-01 Impact factor: 3.358

4. CDD: NCBI's conserved domain database.

Authors: Aron Marchler-Bauer; Myra K Derbyshire; Noreen R Gonzales; Shennan Lu; Farideh Chitsaz; Lewis Y Geer; Renata C Geer; Jane He; Marc Gwadz; David I Hurwitz; Christopher J Lanczycki; Fu Lu; Gabriele H Marchler; James S Song; Narmada Thanki; Zhouxi Wang; Roxanne A Yamashita; Dachuan Zhang; Chanjuan Zheng; Stephen H Bryant
Journal: Nucleic Acids Res Date: 2014-11-20 Impact factor: 16.971

10. Artificial Neural Networks (ANN) for the Simultaneous Spectrophotometric Determination of Fluoxetine and Sertraline in Pharmaceutical Formulations and Biological Fluid.

Authors: Hamid Reza Akbari Hasanjani; Mahmoud Reza Sohrabi
Journal: Iran J Pharm Res Date: 2017 Impact factor: 1.696

1 in total

1. Fe(2)OG: an integrated HMM profile-based web server to predict and analyze putative non-haem iron(II)- and 2-oxoglutarate-dependent dioxygenase function in protein sequences.

Authors: Siddhartha Kundu
Journal: BMC Res Notes Date: 2021-03-01

1 in total

Mathematical Basis of Predicting Dominant Function in Protein Sequences by a Generic HMM-ANN Algorithm.

Background

Algorithm and Results

Lemma

Proof

Discussion

Contribution of the Probability of Mapping the ANN-Prediction to a Distinct Partition

Theorem

Proof

Relevance of Functional Constraints to Unambiguous Assignment of Dominant Function

Concluding Remarks

1. Phylogenetic analysis of the plant endo-beta-1,4-glucanase gene family.

2. Structural organization and a standardized nomenclature for plant endo-1,4-beta-glucanases (cellulases) of glycosyl hydrolase family 9.

Review 3. The 2-His-1-carboxylate facial triad: a versatile platform for dioxygen activation by mononuclear non-heme iron(II) enzymes.

4. CDD: NCBI's conserved domain database.

5. Distribution and prediction of catalytic domains in 2-oxoglutarate dependent dioxygenases.

6. Protein secondary structure assignment revisited: a detailed analysis of different assignment methods.

7. The carbohydrate-active enzymes database (CAZy) in 2013.

8. Alignment-Annotator web server: rendering and annotating sequence alignments.

9. In silico Identification and Taxonomic Distribution of Plant Class C GH9 Endoglucanases.

10. Artificial Neural Networks (ANN) for the Simultaneous Spectrophotometric Determination of Fluoxetine and Sertraline in Pharmaceutical Formulations and Biological Fluid.

1. Fe(2)OG: an integrated HMM profile-based web server to predict and analyze putative non-haem iron(II)- and 2-oxoglutarate-dependent dioxygenase function in protein sequences.