| Literature DB >> 29700659 |
Siddhartha Kundu1,2.
Abstract
The accurate annotation of an unknown protein sequence depends on extant data of template sequences. This could be empirical or sets of reference sequences, and provides an exhaustive pool of probable functions. Individual methods of predicting dominant function possess shortcomings such as varying degrees of inter-sequence redundancy, arbitrary domain inclusion thresholds, heterogeneous parameterization protocols, and ill-conditioned input channels. Here, I present a rigorous theoretical derivation of various steps of a generic algorithm that integrates and utilizes several statistical methods to predict the dominant function in unknown protein sequences. The accompanying mathematical proofs, interval definitions, analysis, and numerical computations presented are meant to offer insights not only into the specificity and accuracy of predictions, but also provide details of the operatic mechanisms involved in the integration and its ensuing rigor. The algorithm uses numerically modified raw hidden markov model scores of well defined sets of training sequences and clusters them on the basis of known function. The results are then fed into an artificial neural network, the predictions of which can be refined using the available data. This pipeline is trained recursively and can be used to discern the dominant principal function, and thereby, annotate an unknown protein sequence. Whilst, the approach is complex, the specificity of the final predictions can benefit laboratory workers design their experiments with greater confidence.Entities:
Keywords: Algorithm; Artificial neural network; Dominant protein function; Hidden markov model; Subfamily
Mesh:
Substances:
Year: 2018 PMID: 29700659 PMCID: PMC7250805 DOI: 10.1007/s10441-018-9327-x
Source DB: PubMed Journal: Acta Biotheor ISSN: 0001-5342 Impact factor: 1.774
Role of inputs in defining ANN architecture
| | | | | | | |||
|---|---|---|---|---|---|
| 3 | 3 | 3 | 4 | 4 | 3 |
| 4 | 6 | 15 | 12 | 8 | 11 |
| 5 | 10 | 45 | 30 | 14 | 31 |
| 6 | 15 | 105 | 63 | 21 | 71 |
| 7 | 21 | 210 | 120 | 29 | 141 |
| 8 | 28 | 378 | 209 | 39 | 253 |
| 9 | 36 | 630 | 341 | 50 | 421 |
| 10 | 45 | 990 | 527 | 63 | 661 |
| 11 | 55 | 1485 | 782 | 77 | 991 |
| 12 | 66 | 2145 | 1119 | 93 | 1431 |
| 13 | 78 | 3003 | 1557 | 110 | 2003 |
| 14 | 91 | 4095 | 2112 | 128 | 2731 |
| 15 | 105 | 5460 | 2804 | 148 | 3641 |
| 16 | 120 | 7140 | 3655 | 169 | 4761 |
| 17 | 136 | 9180 | 4686 | 192 | 6121 |
| 18 | 153 | 11,628 | 5922 | 216 | 7753 |
A: Set of raw HMM scores of a protein sequence
B: Set of pairs of raw HMM scores of a protein sequence
D: Set of pairs-of-pairs of raw HMM scores of a protein sequence
H1:
H2:
H3:
Fig. 1Generic algorithm for predicting dominant function in a protein sequence. a Steps needed to construct and validate the HMM–ANN algorithm on a well characterized training set. The datasets may be repeatedly sampled for parameter definition and model refinement. The final output is a set of high confidence bounds that is mapped and specific for each predicted function, b analysis of cardinality of various sets used in parameterization, c scatter plot between the number of predicted function and the pairs-of-pairs of modified HMM scores, and d relevance of cardinality of the superset of probable functions to the architecture of the ANN. Abbreviations: HMM, hidden markov model; ANN, artificial neural network; A, B, D, Sets of raw HMM scores; H1, H2, H3, Methods to compute number of nodes in the hidden layer of a 1:1:1 ANN