| Literature DB >> 20705649 |
Miguel A Santos1, Andrei L Turinsky, Serene Ong, Jennifer Tsai, Michael F Berger, Gwenael Badis, Shaheynoor Talukder, Andrew R Gehrke, Martha L Bulyk, Timothy R Hughes, Shoshana J Wodak.
Abstract
Classifying proteins into subgroups with similar molecular function on the basis of sequence is an important step in deriving reliable functional annotations computationally. So far, however, available classification procedures have been evaluated against protein subgroups that are defined by experts using mainly qualitative descriptions of molecular function. Recently, in vitro DNA-binding preferences to all possible 8-nt DNA sequences have been measured for 178 mouse homeodomains using protein-binding microarrays, offering the unprecedented opportunity of evaluating the classification methods against quantitative measures of molecular function. To this end, we automatically derive homeodomain subtypes from the DNA-binding data and independently group the same domains using sequence information alone. We test five sequence-based methods, which use different sequence-similarity measures and algorithms to group sequences. Results show that methods that optimize the classification robustness reflect well the detailed functional specificity revealed by the experimental data. In some of these classifications, 73-83% of the subfamilies exactly correspond to, or are completely contained in, the function-based subtypes. Our findings demonstrate that certain sequence-based classifications are capable of yielding very specific molecular function annotations. The availability of quantitative descriptions of molecular function, such as DNA-binding data, will be a key factor in exploiting this potential in the future.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20705649 PMCID: PMC3001082 DOI: 10.1093/nar/gkq714
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Objective subfamily classifications of mouse HD by CD-HIT and the three graph pruning procedures. Two curves are plotted on each panel. One is the variation of the largest cluster size as a function of the threshold for pruning the sequence similarity/dissimilarity graph (red dots), and the other is the average variation of information distance (VI) between a partition (subfamily classification) at a given threshold and those of its four flanking partitions. The dotted vertical line indicates the pruning threshold value at which the entropy of the resulting partition equals half its maximum value (ln N). The selected robust partition corresponds to the lowest local minimum in the VI curve found after that midpoint (indicated by an arrow). Shown are four plots computed using two methods, applied to different similarity/dissimilarity graphs; PID stands for percent sequence identity; TRE stands for total relative entropy. (A) and (B) were derived from the mH178 data set; (C) and (D), from the H559 data set. (A) Plots for CD-HIT; robust partitions identified at PID = 78%. (B) Plots for OSLPID, robust partition identified at PID =81%. (C) Plots for OSLE-val, robust partition identified at BLAST log E-values = –25.7. (D) Plots for OSLTRE,; robust partition identified at TRE = 0.1317. The direction of this plot is reversed relative to the three others, as TRE is a dissimilarity measure.
Figure 2.Identification of robust TribeMCL mouse HD classifications based on sequence information. Heat plot displays the average VI distance of each TribeMCL partition to its eight immediate-neighbor partitions for the mH178 (top) and H559 (bottom) datasets. The horizontal values are the negative powers of 10 used as the BLAST E-value thresholds. Vertical values are for the MCL inflation parameter I. The immediate-neighbor partitions are determined by one-step perturbation of the –log E-value (±1, horizontal axis) and/or I (±0.4, vertical axis) parameters. The distance between partitions is computed using the Variation of Information (VI) metric, and follows the depicted color scale. Robust regions are shown within black rectangles, with the most robust solutions indicated by arrows (see ‘Materials and Methods’ section).
Summary of the mouse HD classifications derived on the basis of sequence information
The listed classifications were derived using five different methods using either default, or optimized settings. SECATOR was applied using default settings. Highlighted in gray are the coarser grained classifications, which display poor overlap with HD groups derived on the basis of the DNA-binding data.
SCI-PHY was applied using the Default and Entropy modes, respectively.
For the remaining methods thresholds and parameter values were objectively defined as those yielding classifications that are robust against small changes in these values (see ‘Materials and Methods’ section).
The Objective Single Linkage (OSL) procedure was applied to three different sequence-similarity graphs built using the pairwise sequence identity (PID), the BLASTP log E-value (E-value) and the Total Relative Entropy (TRE) as sequence-similarity measures.
‘MCLb-pref’ is the classification derived using the MCL clustering algorithm applied to the graph built from the pairwise Pearson Correlation (PC) of the measured DNA-binding profiles (E-scores; see ‘Materials and Methods’ section).
‘Berger’ is the manually adjusted classification of Berger et al. (24).
All the sequence-based classifications were applied to the mH178 and H559 data sets (column 3). The ‘clustering summary’ (column 4) lists the total number of subfamilies, the number of subfamilies with at least two members and the number of HDs in the largest subfamily.
Columns 5 and 6 list the Variation of Information (VI) distance of the sequences based classifications to the HD subtypes derived here from the DNA-binding data (MCLb-pref) and to the manually adjusted Berger classification, respectively.
The last two columns list the purity scores of the subfamilies relative to the same two classifications. In performing the Purity score calculations, the MCLb-pref and Berger et al. classification were used as ‘reference’ partitions.
Figure 3.Identification of robust MCL partitions for the mH178 data set based on the in vitro DNA-binding profiles. The heat plot displays the average distance of each MCL partition of the mH178 data set to its eight immediate-neighbor partitions, each obtained with one-step perturbation of the γ (±2, horizontal axis) and/or I (±0.4, vertical axis) parameters (see ‘Results’ section for details). The distance between partitions is computed using the Variation of Information (VI) metric, and follows the depicted color scale. Robust partitions (
Figure 4.Information on binding preferences mapped onto the sequence-based subfamilies derived using the Default- and Entropy- SCI-PHY modes. Scatter plots of the PWM overlap score, which quantifies the similarity between the DNA-binding sites of two homeodomains (see ‘Materials and Methods’ section) (vertical axis), against the pairwise Pearson Correlation (PC) of the 8-mer DNA-binding scores (E-scores) from (24) (horizontal axis). Both quantities are computed for individual HD pairs belonging to the same subfamily, and points representing pairs from different subfamilies are colored differently. (A) Scatter plots for pairs within the 55 subfamilies of the binding-data based MCLb-pref classification. (B) Scatter plots for pairs within the subfamilies derived by Berger et al. including all 178 mouse homeodomains (Supplementary Table S2). (C) Scatter plots for pairs within the 84 subfamilies of the OSLPID classification. (D) Scatter plots for pairs within the 33 mouse HD subfamilies of the Default-H559 classification.
Pairwise comparisons of different mouse HD classifications: VI distances
| SCI-PHYEntropy mH178 | SCI-PHYEntropy H559 | CD-HIT | OSLPID | OSLTRE mH178 | OSLTRE H559 | OSLEval mH178 | OSLEval H559 | |
|---|---|---|---|---|---|---|---|---|
| SCI-PHYEntropy mH178 | 0.00 | |||||||
| SCI-PHYEntropy H559 | 0.00 | 0.13 | 0.14 | 0.11 | 0.14 | 0.18 | 0.19 | |
| CD-HIT | 0.00 | 0.09 | 0.10 | 0.11 | 0.15 | 0.16 | ||
| OSLPID | 0.00 | 0.10 | 0.05 | 0.10 | 0.10 | |||
| OSLTRE mH178 | 0.00 | 0.15 | 0.09 | 0.10 | ||||
| OSLTRE H559 | 0.00 | 0.14 | 0.15 | |||||
| OSLEval mH178 | 0.00 | 0.01 | ||||||
| OSLEval H559 | 0.00 |
Summary of the pairwise similarity levels, in terms of Variation-of-Information (VI) distances, between the seven different fine-grained sequence-based classifications.
These classifications were derived from the mH178 or H559 data set, as indicated (see ‘Materials and Methods’ section for details), using three different methods: SCI-PHY in the Entropy mode, CD-HIT and the three Objective Single Linkage (OSL) procedures. OSLPID, and CD-HIT each produced exactly the same classification when applied to either of the 2 HD data sets.
Relatively high VI values corresponding to poor similarity levels between the listed classifications are shown as underlined italic.
Pairwise comparisons of different mouse HD classifications: purity scores
| SCI-PHYEntropy mH178 | SCI-PHYEntropy H559 | CD-HIT | OSLPID | OSLTRE mH178 | OSLTRE H559 | OSLEval mH178 | OSLEval H559 | |
|---|---|---|---|---|---|---|---|---|
| SCI- PHYEntropy mH178 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| SCI- PHYEntropy H559 | 1.00 | 0.91 | 0.96 | 0.98 | 0.93 | 0.98 | 0.98 | |
| CD-HIT | 0.92 | 1.00 | 0.96 | 0.98 | 0.91 | 0.98 | 0.98 | |
| OSLPID | 0.94 | 0.96 | 1.00 | 1.00 | 0.93 | 1.00 | 1.00 | |
| OSLTRE mH178 | 0.85 | 0.85 | 0.89 | 1.00 | 0.82 | 0.95 | 0.95 | |
| OSLTRE H559 | 0.98 | 0.98 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | |
| OSLEval mH178 | 0.79 | 0.80 | 0.84 | 0.91 | 0.77 | 1.00 | 1.00 | |
| OSLEval H559 | 0.79 | 0.80 | 0.82 | 0.89 | 0.77 | 0.98 | 1.00 |
Summary of the pairwise Purity scores (as defined in ‘Materials and Methods’ section) between the seven different fine-grained sequence-based classifications. Classifications listed along columns were considered as ‘target’ partitions, whereas those listed along rows were used as ‘reference’ partitions. See Table 2 for information on these classifications.
Low Purity values, underlined (column 2), indicate a poor correspondence between the target classifications in columns 1 and the SCI-PHYentropy mH178 classification used as reference.
In all other cases, the target classifications represent near perfect subdivisions of the reference partitions.
Figure 5.Correspondence between the Entropy-SCI-PHY classification of the H559 data set and the robust MCLb-pref partition derived from the in-vitro DNA-binding profiles. Shown is the correspondence in terms of membership in individual subfamilies, for the pair of partitions MCLb-pref (γ = 94, I = 3.2) and the Entropy-H559 classification. The 84 Entropy-H559 subfamilies are listed in rows, in order of decreasing size (1–48 for the 48 nonsingleton subfamilies), with the 36 singleton subfamilies grouped in the bottom row. Individual homeodomains are colored according to their membership in the 55 mouse clusters from the MCLb-pref clustering solution, which contains 33 nonsingleton clusters and 22 singletons. Different colors correspond to different clusters in the MCLb-pref classification. Singleton MCLb-pref clusters are colored in white. Out of the 48 nonsingleton SCI-PHY subfamilies, 35 are either exactly identical to an MCLb-pref cluster (12 subfamilies: #2, 7, 9, 12, 16, 19, 25, 26, 28, 37, 40, 42, 44), or are entirely contained within such cluster (23 subfamilies: #5, 8, 10, 11, 13, 14, 15, 17, 18, 20, 22, 23, 27, 31, 32, 35, 36, 38, 39, 41, 45, 48), whereas the remaining 13 subfamilies group HDs from more than one MCLb-pref clusters (see text).