| Literature DB >> 25073805 |
Swati Sinha, Andrew Michael Lynn1.
Abstract
BACKGROUND: HMM-ModE is a computational method that generates family specific profile HMMs using negative training sequences. The method optimizes the discrimination threshold using 10 fold cross validation and modifies the emission probabilities of profiles to reduce common fold based signals shared with other sub-families. The protocol depends on the program HMMER for HMM profile building and sequence database searching. The recent release of HMMER3 has improved database search speed by several orders of magnitude, allowing for the large scale deployment of the method in sequence annotation projects. We have rewritten our existing scripts both at the level of parsing the HMM profiles and modifying emission probabilities to upgrade HMM-ModE using HMMER3 that takes advantage of its probabilistic inference with high computational speed. The method is benchmarked and tested on GPCR dataset as an accurate and fast method for functional annotation.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25073805 PMCID: PMC4236727 DOI: 10.1186/1756-0500-7-483
Source DB: PubMed Journal: BMC Res Notes ISSN: 1756-0500
Figure 1Benchmarking the HMM-ModE protocol with HMMERv2 and HMMERv3. The protocol was benchmarked with both versions of HMMER v2 – which permits Glocal alignments – and HMMER v3 which only has local-local alignments on a dataset of 416 enzyme families. Specificity values reported during profile building are improved in all cases, with minimal loss in sensitivity. The specificity improvement is lower for HMMER v3 compared to the earlier version.
Performance evaluation of the new version of our method HMM-ModE on ‘gold standard’ dataset and comparison with HMM-ModE/HMMER2 [9]
| AminoHydrolase | AMP deaminase | 28 | 1 | 0.90 | 1(1) | 1(1) |
| urease | 100 | 1 | 0.50 | 1(1) | 1(1) | |
| D-hydantoinase | 10 | 1 | 0.05 | 1(1) | 1(1) | |
| Dihydroorotase 2 | 13 | 1 | 0.14 | 1(1) | 1(1) | |
| Guanine deaminase | 11 | 1 | 0.22 | 1(1) | 1(1) | |
| Adenosine deaminase | 10 | 1 | 0.35 | 1(1) | 1(1) | |
| Enolase | Enolase | 215 | 1 | 0.88 | 1(1) | 1(1) |
| Glucarate dehydratase | 26 | 1 | 0.20 | 1(1) | 1(1) | |
| Muconate cycloisomerase | 14 | 1 | 0.09 | 1(1) | 1(1) | |
| Chloromuconate cycloisomerase | 10 | 1 | 0.05 | 1(1) | 1(1) | |
| Crotonase | Enoyl-CoA hydratase | 54 | 1 | 0.23 | 1(1) | 1(1) |
| Histone acetyltransferase | 11 | 11 | 0.09 | 1(1) | 1(1) | |
| Haloacid Dehydrogenase | P-type atpase | 91 | 1 | 0.77 | 1(1) | 1(1) |
| Vicinal Oxygen chelate | Catechol 2,3-dioxygenase | 32 | 1 | 0.28 | 1(1) | 0.88 (0.88) |
| 4-Hydroxyphenylpyruvate dioxygenase | 26 | 1 | 0.60 | 1(1) | 1(1) | |
| 2,3-Dihydroxybiphenyl dioxygenase | 23 | 1 | 0.20 | 0.95 (1) | 0.71(0.63) | |
| Glyoxalase 1 | 12 | 1 | 0.20 | 1(1) | 1(1) | |
HMM-d stands for default HMMER and # Seq stands for number of sequences in each family. Se is Sensitivity and Sp is Specificity. Sn is the sensitivity value calculated as TP/(TP + FN), and Sp is the specificity calculated as TP/(TP + FP). As shown here, there is a significant improvement in the specificity of the predictions with small compromise in the respective sensitivity using HMM-ModE.
The corresponding values of Se and Sp using HMM-ModE with HMMER2 is written in parenthesis.
The table shows the comparison of our method with other methods and with HMM-ModE/HMMER2 [9] to classify D167 dataset
| | | | | | | | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| [ | [ | [ | [ | [ | [ | [ | |||||
| Acetylcholine | 31 | 29 | 100 | 67.74 | 90.3 | 93.6 | 93.3 | 96.7 | |||
| Adrenoceptor | 44 | 44 | 100 | 88.64 | 86.4 | 100 | 100 | 100 | |||
| Dopamine | 38 | 37 | 94.74 | 81.58 | 78.9 | 92.1 | 94.7 | 92.1 | |||
| Serotonin | 54 | 54 | 98.15 | 88.89 | 79.6 | 98.2 | 100 | 100 |
Accuracy is calculated as (TP + TN)/(TP + FN + TN + FP) here and throughout.
“Total” is the number of sequences observed in a sub-family, “Predicted” is the number of sequences correctly predicted by our method. The accuracy values for the other methods are directly taken from their articles. The data in bold depicts the results using HMM-ModE with HMMER3 and HMMER2.
The table shows the comparison of our method with PCA-GPCR and with HMM-ModE/HMMER2 [9] to classify D566 dataset
| [ | [ | |||||
|---|---|---|---|---|---|---|
| D566 | Adrenoceptor | 66 | 65 | 98.48 | ||
| Chemokine | 92 | 84 | 97.83 | |||
| Dopamine | 43 | 41 | 93.02 | |||
| Neuropeptide | 31 | 31 | 96.77 | |||
| Olfactory | 84 | 84 | 100 | |||
| Rhodopsin | 183 | 183 | 98.36 | |||
| Serotonin | 67 | 67 | 97.01 | |||
“Total” is the number of sequences observed in a sub-family, “Predicted” is the number of sequences correctly predicted by our method. The accuracy values for the other method are directly taken from their article.
The data in bold depicts the results using HMM-ModE with HMMER3 and HMMER2.
Comparison with PCA-GPCR and HMM-ModE/HMMER2 [9] to classify D1238 and D365 dataset
| [ | [ | |||||
|---|---|---|---|---|---|---|
| D1238 | Rhodopsin like | 1103 | 1030 | 99.91 | ||
| Secretin like | 84 | 66 | 98.81 | |||
| Metabotropic/glutamate/pheromone | 51 | 47 | 98.04 | |||
| D365 | Rhodopsin like | 232 | 200 | 95.69 | ||
| Secretin like | 39 | 23 | 87.18 | |||
| Metabotropic/glutamate/pheromone | 44 | 34 | 88.64 | |||
| Fungal Pheromone | 23 | 23 | 95.65 | |||
| CAMP receptors | 10 | 10 | 100 | |||
| Frizzled/smoothened | 17 | 17 | 64.71 | |||
The accuracy values for the method PCA-GPCR are directly taken from their article. The data in bold depicts the results using HMM-ModE with HMMER3 and HMMER2.
This table shows the performance of our method, HMM-ModE/HMMER2 [9] and PCA_GPCR on GPCR_human dataset
| | |||||
|---|---|---|---|---|---|
| [ | [ | ||||
| Muscarinic acetylcholine | 11 | 11 | 100 | ||
| Adrenocceptors | 24 | 23 | 95.83 | ||
| Dopamine | 17 | 16 | 94.11 | ||
| Histamine | 16 | 16 | 75.00 | ||
| Serotonin | 26 | 25 | 76.92 | ||
| Trace amine | 23 | 14 | 78.26 | ||
“Total” is the number of sequences present in a sub-family. The data in bold depicts the results using HMM-ModE with HMMER3 and HMMER2.
Figure 2Comparison of number of false positives while using HMMER and HMM-ModE. The line diagram in this figure shows a comparison of number of false positives picked up by each of the GPCR subfamily profiles using HMMER profiles with default threshold and the HMM-ModE profiles with a discriminating threshold (similar to the GA1 threshold used by Pfam) generated through 10-fold cross validation as discussed in our published work [9]. As shown in the figure, there is a remarkable reduction in the number of false positives when our method is used which is helpful in the annotation of protein sequences with high specificity.
Figure 3Schematic flow of how HMM-ModE works on a set of pre-classified protein family sequences. The figure shows how the method HMM-ModE works, a set of pre-classified sequences (functionally classified) are used which are clustered using MCL in order to obtain clusters of similar sequences. These clusters are aligned separately and HMM profiles are built using ‘hmmbuild’ from HMMER package, these are known as true positive (TP) profiles. The TP profiles are scanned against all the sequences, ideally the profile should pick sequences belonging to the same family but it always picks up sequences belonging to other families as well due to fold specific signals shared across families. We call these as false positive (FP) sequences and generate FP HMM profiles from them. If the number of FPs is greater than 200 then we perform random sampling and then pick a representative set of 200 sequences to generate the FP profile. Both the TP and FP profiles are then aligned using profile-profile alignment from MUSCLE and this alignment is then used to identify the discriminating residues and modify the corresponding emission probabilty of the TP profile. A 10-fold cross validation is also done to identify a discriminating threshold and we use the modified profiles, known as the HMM-ModE profiles, with modified emission probability and the defined discriminating threshold.