| Literature DB >> 17389042 |
Prashant K Srivastava1, Dhwani K Desai, Soumyadeep Nandi, Andrew M Lynn.
Abstract
BACKGROUND: Profile Hidden Markov Models (HMM) are statistical representations of protein families derived from patterns of sequence conservation in multiple alignments and have been used in identifying remote homologues with considerable success. These conservation patterns arise from fold specific signals, shared across multiple families, and function specific signals unique to the families. The availability of sequences pre-classified according to their function permits the use of negative training sequences to improve the specificity of the HMM, both by optimizing the threshold cutoff and by modifying emission probabilities to minimize the influence of fold-specific signals. A protocol to generate family specific HMMs is described that first constructs a profile HMM from an alignment of the family's sequences and then uses this model to identify sequences belonging to other classes that score above the default threshold (false positives). Ten-fold cross validation is used to optimise the discrimination threshold score for the model. The advent of fast multiple alignment methods enables the use of the profile alignments to align the true and false positive sequences, and the resulting alignments are used to modify the emission probabilities in the original model.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17389042 PMCID: PMC1852395 DOI: 10.1186/1471-2105-8-104
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1A multiple alignment showing the common fold specific signals, along with the group specific sub-family function specific signals. Alscript [36] figure showing a portion of the alignment of representatives of six protein kinase families discussed in the text. The alignment is coloured based on residue conservation: Red and pink – identical and conserved across all families – correspond to fold signals, and blue and green – identical and conserved within a family. Positions predicted to confer specificity for the family [35] are highlighted in yellow. Deleted regions are indicated by dashes (- - -). Numbers below the alignment correspond to the PDB structure 2f7z.
Figure 2A Receiver-Operator Characteristic curve (ROC) of HMM-d and HMM-ModE for the PVPK sub-family. HMM-ModE – blue; HMM-d – red;
Figure 3Determination of optimal discrimination threshold. The average MCC(bold black) distribution is overlayed on the sensitivity and specificity plots for each of 10-fold cross validation samples of the PVPK sub-family. Figures are plotted for the default profile HMM-d (top, A), HMM-ModE (center, B) and HMM-Sub(bottom, C).
Figure 4Six subfamilies of the AGC family of protein kinases
Performance of HMM-d, HMM-t, HMM-ModE and HMM-Sub for the sub-family classification of the AGC family of kinases.
| 1 | (0) | 0.27 | (0.03) | 1 | (0) | 1 | (0) | 1 | (0) | 1 | (0) | 1 | (0) | 1 | (0) | |
| 0.96 | (0.1) | 0.18 | (0.02) | 0.89 | (0.13) | 1 | (0) | 0.89 | (0.13) | 1 | (0) | 0.96 | (0.07) | 1 | (0) | |
| 0.99 | (0.05) | 0.42 | (0.1) | 0.95 | (0.08) | 0.99 | (0.2) | 0.96 | (0.08) | 1 | (0) | 0.97 | (0.06) | 1 | (0) | |
| 1 | (0) | 0.17 | (0.03) | 0.94 | (0.1) | 0.93 | (0.13) | 0.94 | (0.1) | 0.96 | (0.12) | 0.96 | (0.08) | 1 | (0) | |
| 1 | (0) | 0.09 | (0.00) | 0.9 | (0.16) | 1 | (0) | 0.93 | (0.1) | 1 | (0) | 0.97 | (0.1) | 1 | (0) | |
| 1 | (0) | 0.14 | (0.01) | 0.98 | (0.08) | 0.98 | (0.06) | 0.975 | (0.08) | 0.98 | (0.06) | 0.93 | (0.24) | 0.98 | (0.06) | |
The values in parentheses indicate the standard deviation for all 10 samples in a 10-fold cross validation.
HMM-d – HMM profile used with default threshold
HMM-t – HMM profile used with optimised threshold
HMM-ModE – profile with modified emission probabilities
HMM-Sub – Log-difference-of-odds-score method
Figure 5An outline of some Level 1 and Level 2 subfamilies of the GPCR Class A proteins. The level-2 sub-families used in this study are marked in bold.
Figure 6An outline of some Level 1 and Level 2 subfamilies of the GPCR Class C proteins. The level-2 sub-families used in this study are marked in bold.
Coverage (percentage of True Positives identified before the first False Positive) and the average percentage of errors per sequence at the MEP of HMM-d and HMM-ModE for classification of Level-2 sub-families of Class A and Class C GPCR proteins
| 1 | 1 | 1 | 0.92 | 0 | 0 | 0 | 0.08 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 0.48 | 0.97 | 0.45 | 0.97 | 0.06 | 0.03 | 0.06 | 0.03 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 0.89 | 1 | 0.89 | 0 | 0.11 | 0 | 0.11 | |
| 0.5 | 0.8 | 0.5 | 0.8 | 0.5 | 0.2 | 0.5 | 0.2 | |
| 1 | 0.55 | 0.98 | 0.55 | 0 | 0.09 | 0.02 | 0.11 | |
| 0.87 | 0.91 | 0.87 | 0.91 | 0.13 | 0.09 | 0.13 | 0.09 | |
| 1 | 1 | 1 | 0.88 | 0.29 | 0 | 0.29 | 0.13 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 0.86 | 1 | 0.86 | 1 | 0.14 | 0 | 0.14 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 0.5 | 0 | 0 | 0 | 0.5 | |
| 0.76 | 0.73 | 0.76 | 0.73 | 0.24 | 0.27 | 0.24 | 0.27 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
| 0.88 | 1 | 0.88 | 1 | 0.13 | 0 | 0.13 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 0.29 | 1 | 0.29 | 1 | 0.71 | 0 | 0.71 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 0.86 | 0.80 | 0.86 | 0.80 | 0.14 | 0.10 | 0.14 | 0.10 | |
| 1 | 1 | 0.88 | 1 | 0 | 0 | 0.13 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 0.71 | 1 | 0.71 | 0 | 0.29 | 0 | 0.29 | |
| 0.83 | 1 | 1 | 1 | 0.17 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 0.86 | 0 | 0 | 0 | 0.14 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 0.91 | 1 | 0 | 0 | 0.09 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 0.6 | 1 | 0.6 | 1 | 0.4 | 0 | 0.4 | 0 | |
| 0.86 | 0.67 | 0.71 | 0.50 | 0.14 | 0.33 | 0.29 | 0.50 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 0.94 | 0.94 | 0.94 | 0.94 | 0.06 | 0.06 | 0.06 | 0.06 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 0.92 | 1 | 0.92 | 1 | 0.08 | 0 | 0.08 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | |
| 0.95 | 0.97 | 0.94 | 0.95 | 0.06 | 0.05 | 0.07 | 0.07 | |
| 0.96 | 0.95 | 0.06 | 0.07 | |||||
Figure 7An outline of the S/T-Y kinase/atypical kinase/lipid kinase/ATP-grasp Fold Group as categorized in [23]. The EC numbers for which training sequences were available in the ENZYME database are marked in bold.
Application of HMM-d, HMM-t, HMM-ModE and HMM-Sub for function-specific classification of the S/T-Y kinase/atypical kinase/lipid kinase/ATP-grasp fold family
| 2.7.1.100 | 19 | 16(0) | * | * |
| 2.7.1.116 | 43 | 43(4) | * | * |
| 2.7.1.117 | 103 | 103(8) | * | * |
| 2.7.1.123 | 3392 | 529(120) | 934(203) | 3264 |
| 2.7.1.125 | 295 | 11(4) | 11(4) | 259 |
| 2.7.1.126 | 96 | 34(2) | 34(2) | 37 |
| 2.7.1.129 | 5 | 5(0) | * | * |
| 2.7.1.137 | 135 | 135(29) | * | * |
| 2.7.1.32 | 93 | 64(23) | * | * |
| 2.7.1.38 | 2634 | 22(4) | 22(4) | 109 |
| 2.7.1.39 | 260 | 260(54) | * | * |
| 2.7.1.67 | 57 | 57(15) | * | * |
| 2.7.1.68 | 36 | 36(15) | * | * |
| 2.7.1.82 | 45 | 31(10) | * | * |
| 2.7.1.95 | 19 | 19(3) | * | * |
| 2.7.9.1 | 171 | 169(55) | * | * |
| 2.7.9.2 | 111 | 107(10) | * | * |
The numbers in the parentheses for HMM-ModE and HMM-t are the total counts of sequences annotated as hypothetical, putative, unknown and unnamed which have been classified by the two protocols. A "*" in the HMM-ModE and HMM-Sub columns indicates that the number of false positive sequences picked up by the HMMER profile from the negative training data were not sufficient to build a false positive profile.