| Literature DB >> 23700359 |
Eduardo P Costa1, Celine Vens, Hendrik Blockeel.
Abstract
We propose a novel method for the task of protein subfamily identification; that is, finding subgroups of functionally closely related sequences within a protein family. In line with phylogenomic analysis, the method first builds a hierarchical tree using as input a multiple alignment of the protein sequences, then uses a post-pruning procedure to extract clusters from the tree. Differently from existing methods, it constructs the hierarchical tree top-down, rather than bottom-up and associates particular mutations with each division into subclusters. The motivating hypothesis for this method is that it may yield a better tree topology with more accurate subfamily identification as a result and additionally indicates functionally important sites and allows for easy classification of new proteins. A thorough experimental evaluation confirms the hypothesis. The novel method yields more accurate clusters and a better tree topology than the state-of-the-art method SCI-PHY, identifies known functional sites, and identifies mutations that alone allow for classifying new sequences with an accuracy approaching that of hidden Markov models.Entities:
Keywords: clustering trees; decision trees; phylogenomics; protein subfamily identification; top-down clustering
Year: 2013 PMID: 23700359 PMCID: PMC3653887 DOI: 10.4137/EBO.S11609
Source DB: PubMed Journal: Evol Bioinform Online ISSN: 1176-9343 Impact factor: 1.625
Figure 1Illustration of a split based on a polymorphic position.
Figure 2Pseudocode for the Clus-based approach.
Figure 3Example of a tree output by our method.
Note: The internal nodes contain the test (or set of equivalent tests) chosen for every split. The leaf nodes correspond to the predicted subfamilies.
Figure 4Two trees with the same edited tree size, a smaller TBC error for tree a, and a smaller number of subfamily changes for tree b.
Note: Branches are labeled to make the explanation easier.
Statistics for the datasets.
| Datasets | Nb subfam | Nb seq | Align length | Avg dist (family) | Avg dist (subfam) |
|---|---|---|---|---|---|
| Enolase | 8 | 472 | 431 | 2.229 | 1.041 |
| Crotonase | 10 | 365 | 264 | 1.842 | 0.728 |
| Secretin | 15 | 153 | 263 | 1.885 | 0.485 |
| Amine 1 | 7 | 358 | 344 | 1.467 | 1.075 |
| Amine 2 | 31 | 358 | 344 | 1.467 | 0.442 |
| NHR 1 | 8 | 412 | 183 | 2.124 | 0.945 |
| NHR 2 | 27 | 412 | 183 | 2.124 | 0.547 |
| NHR 3 | 77 | 409 | 183 | 2.116 | 0.263 |
| Thyroid 1 | 8 | 799 | 239 | 1.771 | 0.708 |
| Thyroid 2 | 24 | 799 | 239 | 1.771 | 0.375 |
| Estrogen 1 | 3 | 482 | 226 | 1.041 | 0.498 |
| Estrogen 2 | 10 | 482 | 226 | 1.041 | 0.301 |
| HNF4 1 | 5 | 448 | 229 | 1.276 | 0.619 |
| HNF4 2 | 22 | 448 | 229 | 1.276 | 0.404 |
| Nerve | 5 | 76 | 219 | 0.429 | 0.26 |
| Fushi | 4 | 117 | 227 | 0.756 | 0.369 |
| DAX | 2 | 40 | 133 | 0.867 | 0.397 |
Note: For each dataset we report the number of subfamilies, the number of sequences, the MSA length, the average pairwise distance between all sequences within the family, and the overall average distance within the subfamilies (we first calculate the average pairwise distance for each subfamily, and then we report the average value over all subfamilies). The sequence distances were calculated based on the Jones-Taylor-Thornton model.
Number of leaves in the classification trees (CTs).
| Datasets | Nb leaves | Datasets | Nb leaves |
|---|---|---|---|
| Enolase | 8 | Thyroid 1 | 13 |
| Crotonase | 11 | Thyroid 2 | 38 |
| Secretin | 15 | Estrogen 1 | 4 |
| Amine 1 | 14 | Estrogen 2 | 15 |
| Amine 2 | 34 | HNF4 1 | 7 |
| NHR 1 | 11 | HNF4 2 | 36 |
| NHR 2 | 30 | Nerve | 5 |
| NHR 3 | 79 | Fushi | 4 |
| DAX | 2 |
Note: The CTs were built using supervised learning. All the leaf nodes in the CTs are pure.
Edited tree size: choosing the test selection criterion.
| Datasets | Clus-MinLength | Clus-MaxAvgDist | Clus-MaxMinDist |
|---|---|---|---|
| Enolase | 51 | 25 | |
| Crotonase | 33 | 111 | |
| Secretin | 32 | 21 | |
| Amine 1 | 54 | 33 | |
| Amine 2 | 49 | 75 | |
| NHR 1 | 49 | 36 | |
| NHR 2 | 43 | 68 | |
| NHR 3 | 139 | 107 |
Note: We indicate the best results in boldface.
TBC error: choosing the test selection criterion.
| Datasets | Clus-MinLength | Clus-MaxAvgDist | Clus-MaxMinDist |
|---|---|---|---|
| Enolase | 90 | 64 | |
| Crotonase | 41 | 99 | |
| Secretin | 30 | 11 | |
| Amine 1 | 178 | 217 | |
| Amine 2 | 80 | 56 | |
| NHR 1 | 105 | 257 | |
| NHR 2 | 80 | 38 | |
| NHR 3 | 108 | 70 |
Note: We indicate the best results in boldface.
Number of subfamily changes: choosing the test selection criterion.
| Datasets | Clus-MinLength | Clus-MaxAvgDist | Clus-MaxMinDist |
|---|---|---|---|
| Enolase | 14 | ||
| Crotonase | 11 | 37 | |
| Secretin | 15 | 15 | |
| Amine 1 | 26 | 19 | |
| Amine 2 | 49 | 38 | |
| NHR 1 | 24 | 16 | |
| NHR 2 | 48 | 32 | |
| NHR 3 | 97 | 83 |
Note: We indicate the best results in boldface.
Edited tree size: evaluating the Clus-MinLength topologies.
| Datasets | Clus-MinLength | SCI-PHY | NJ |
|---|---|---|---|
| Enolase | 28 | 37.7 | |
| Crotonase | 33 | 70 | |
| Secretin | 19 | 22 | |
| Amine 1 | 30 | 36 | |
| Amine 2 | 52 | 54.7 | |
| NHR 1 | 22 | 30 | |
| NHR 2 | 43 | 38 | |
| NHR 3 | 105 | 104.7 | |
| Thyroid 1 | 28 | 34.7 | |
| Thyroid 2 | 86 | 103.7 | |
| Estrogen 1 | 13 | 19.7 | |
| Estrogen 2 | 44 | 52.7 | |
| HNF4 1 | 21 | 29.7 | |
| HNF4 2 | 111 | 136.7 | |
| Nerve | 23.7 | ||
| Fushi | 11 | 16.7 | |
| DAX | 8.7 |
Note: We indicate the best results in boldface.
TBC error: evaluating the Clus-MinLength topologies.
| Datasets | Clus-MinLength | SCI-PHY | NJ |
|---|---|---|---|
| Enolase | 64 | 189 | |
| Crotonase | 50 | 137.3 | |
| Secretin | 13 | 12.3 | |
| Amine 1 | 178 | 242 | |
| Amine 2 | 96 | 59 | |
| NHR 1 | 269 | 133 | |
| NHR 2 | 34 | 85 | |
| NHR 3 | 58 | 79 | |
| Thyroid 1 | 117 | 55 | |
| Thyroid 2 | 130 | 116 | |
| Estrogen 1 | 12 | 234 | |
| Estrogen 2 | 44 | 161 | |
| HNF4 1 | 67 | 152 | |
| HNF4 2 | 202 | 161 | |
| Nerve | 28 | ||
| Fushi | 29 | 53 | |
| DAX | 4 | 7 |
Note: We indicate the best results in boldface.
Number of subfamily changes: evaluating the Clus-MinLength topologies.
| Datasets | Clus-MinLength | SCI-PHY | NJ |
|---|---|---|---|
| Enolase | 9 | 11 | |
| Crotonase | 11 | 16 | |
| Secretin | 15 | 15 | |
| Amine 1 | 16 | 22 | |
| Amine 2 | 37 | 42 | |
| NHR 1 | 10 | 12 | |
| NHR 2 | 29 | 29 | |
| NHR 3 | 78 | 82 | |
| Thyroid 1 | 16 | 17 | |
| Thyroid 2 | 58 | 42 | |
| Estrogen 1 | 6 | 5 | |
| Estrogen 2 | 27 | 20 | |
| HNF4 1 | 11 | ||
| HNF4 2 | 67 | 51 | |
| Nerve | |||
| Fushi | 6 | 6 | |
| DAX |
Note: We indicate the best results in boldface.
Clustering predictions for the Expert datasets.
| Datasets | Eval. measure | Clus-MLth- ECC | SCI-PHY |
|---|---|---|---|
| Enolase | Purity | ||
| PctExPureC | 0.89 | ||
| VI distance | 1.676 | ||
| Edit distance | 70 | ||
| Nb clusters | 48 | 78 | |
| Crotonase | Purity | 0.94 (15/16) | |
| PctExPureC | 0.521 | ||
| VI distance | 1.58 | ||
| Edit distance | 32 | ||
| Nb clusters | 37 | 38 | |
| Secretin | Purity | 0.88 (14/16) | |
| PctExPureC | 0.673 | ||
| VI distance | 0.565 | ||
| Edit distance | 15 | ||
| Nb clusters | 21 | 22 | |
| Amine 1 | Purity | 0.96 (45/47) | |
| PctExPureC | 0.95 | ||
| VI distance | 1.87 | ||
| Edit distance | 46 | ||
| Nb clusters | 49 | 43 | |
| Amine 2 | Purity | 0.86 (32/37) | |
| PctExPureC | 0.701 | ||
| VI distance | 0.898 | ||
| Edit distance | 38 | ||
| Nb clusters | 49 | 43 | |
| NHR 1 | Purity | ||
| PctExPureC | 0.959 | ||
| VI distance | 1.984 | ||
| Edit distance | 40 | ||
| Nb clusters | 48 | 46 | |
| NHR 2 | Purity | 0.95 (38/40) | |
| PctExPureC | 0.932 | ||
| VI distance | 0.708 | ||
| Edit distance | 25 | ||
| Nb clusters | 48 | 46 | |
| NHR 3 | Purity | 0.38 (11/29) | |
| PctExPureC | 0.152 | ||
| VI distance | 0.949 | ||
| Edit distance | 54 | ||
| Nb clusters | 45 | 43 |
Note: PctExPureC and VI distance stand for “percentage of examples in pure clusters” and “variation of information distance”, respectively. We indicate the best results in boldface.
Clustering predictions for the NucleaRDB datasets.
| Datasets | Eval. Measure | Clus-MLth- ECC | SCI-PHY |
|---|---|---|---|
| Thyroid 1 | Purity | 0.80 (28/35) | |
| PctExPureC | 0.937 | ||
| VI distance | 1.443 | ||
| Edit distance | 44 | ||
| Nb clusters | 36 | 52 | |
| Thyroid 2 | Purity | 0.63 (22/35) | |
| PctExPureC | 0.645 | ||
| VI distance | 0.691 | ||
| Edit distance | |||
| Nb clusters | 36 | 52 | |
| Estrogen 1 | Purity | ||
| PctExPureC | 0.967 | ||
| VI distance | 1.624 | ||
| Edit distance | 28 | ||
| Nb clusters | 22 | 31 | |
| Estrogen 2 | Purity | 0.73 (11/15) | |
| PctExPureC | 0.521 | ||
| VI distance | 0.835 | ||
| Edit distance | 33 | ||
| Nb clusters | 22 | 41 | |
| HNF4 1 | Purity | 0.93 (27/29) | |
| PctExPureC | 0.951 | ||
| VI distance | 1.471 | ||
| Edit distance | 36 | ||
| Nb clusters | 33 | 41 | |
| HNF4 2 | Purity | 0.47 (9/19) | |
| PctExPureC | 0.156 | ||
| VI distance | 1.249 | ||
| Edit distance | 55 | ||
| Nb clusters | 33 | 41 | |
| Nerve | Purity | 0.25 (1/4) | |
| PctExPureC | 0.224 | ||
| VI distance | 0.6 | ||
| Edit distance | 7 | ||
| Nb clusters | 4 | 8 | |
| Fushi | Purity | 0.667 (4/6) | |
| PctExPureC | 0.367 | ||
| VI distance | 0.774 | ||
| Edit distance | |||
| Nb clusters | 8 | 6 | |
| DAX | Purity | ||
| PctExPureC | |||
| VI distance | 0.633 | ||
| Edit distance | |||
| Nb clusters | 4 | 4 |
Note: PctExPureC and VI distance stand for “percentage of examples in pure clusters” and “variation of information distance”, respectively. We indicate the best results in boldface
Category utility results.
| Datasets | Clus-MinLength-ECC | SCI-PHY |
|---|---|---|
| Enolase | 2.229 | |
| Crotonase | 1.797 | |
| Secretin | 5.177 | |
| Amine 1/2 | 3.138 | |
| NHR 1/2 | 1.919 | |
| NHR 3 | 2.04 | |
| Thyroid 1/2 | 2.15 | |
| Estrogen 1/2 | 2.506 | |
| HNF4 1/2 | 2.131 | |
| Nerve | 4.363 | |
| Fushi | 7.687 | |
| DAX | 8.345 |
Note: As NHR 3 has less sequences than NHR 1 and NHR 2, its clustering differs from the ones obtained for the latter. For this reason, NHR 3 is displayed in a separate row. We indicate the best results in boldface.
Accuracy for the Clus-MinLength-ECC tree, profile HMMs built on the Clus-MinLength-ECC clustering, and profile HMMs profiles built on the SCI-PHY clustering.
| Datasets | Clus-MinLth- ECC (tree) | SCI-PHY+ HMM | Clus-MinLth- ECC+HMM |
|---|---|---|---|
| Enolase | 0.987 | 0.985 | |
| Crotonase | 0.975 | 0.989 | |
| Secretin | 0.882 | 0.948 | |
| Amine 1 | 0.969 | 0.989 | |
| Amine 2 | 0.846 | 0.908 | |
| NHR 1 | 0.973 | 0.998 | |
| NHR 2 | 0.932 | 0.976 | |
| NHR 3 | 0.628 | 0.66 | |
| Thyroid 1 | 0.965 | 0.991 | |
| Thyroid 2 | 0.812 | 0.86 | |
| Estrogen 1 | 0.994 | 0.996 | |
| Estrogen 2 | 0.89 | 0.936 | |
| HNF4 1 | 0.98 | 0.998 | |
| HNF4 2 | 0.511 | 0.652 | |
| Nerve | 0.766 | 0.791 | |
| Fushi | 0.846 | 0.983 | |
| DAX | 0.975 | 1.000 |
Note: We indicate in boldface the best results for the comparison between Clus-MinLength-ECC (tree) and SCI-PHY + HMM.
Enolase subfamily definitions.
| Subfamily 1 | Chloromuconate cycloisomerase |
|---|---|
| Subfamily 2 | Dipeptide epimerase |
| Subfamily 3 | Enolase |
| Subfamily 4 | Galactonate dehydratase |
| Subfamily 5 | Glucarate dehydratase |
| Subfamily 6 | Methylaspartate ammonia-lyase |
| Subfamily 7 | Muconate cycloisomerase |
| Subfamily 8 | O-succinylbenzoate synthase |
Figure 5Identified polymorphic positions in first four levels of the Enolase tree.
Notes: Each line represents an internal node, and shows which positions in the alignment are listed in the node. The line numbers refer to the numbering of the tree nodes in Figure S4 in the supplemental material.