| Literature DB >> 17425794 |
Thomas Madej1, Anna R Panchenko, Jie Chen, Stephen H Bryant.
Abstract
BACKGROUND: To discover remote evolutionary relationships and functional similarities between proteins, biologists rely on comparative sequence analysis, and when structures are available, on structural alignments and various measures of structural similarity. The measures/scores that have most commonly been used for this purpose include: alignment length, percent sequence identity, superposition RMSD and their different combinations. More recently, we have introduced the "Homologous core structure overlap score" (HCS) and the "Loop Hausdorff Measure" (LHM). Along with these we also consider the "gapped structural alignment score" (GSAS), which was introduced earlier by other researchers.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17425794 PMCID: PMC1852803 DOI: 10.1186/1472-6807-7-23
Source DB: PubMed Journal: BMC Struct Biol ISSN: 1472-6807
Sensitivity values estimated from curves. Sensitivity values estimated from the curves (Figure 1) at 1% and 5% error rates (fraction of false positives) are listed for different similarity measures: loop Hausdorff measure (LHM), HCS score (HCS), gapped structural alignment score (GSAS), percent aligned (%aln), percent identity (%id), root mean square deviation (RMSD), and two other structural similarity measures (SI and MI) from [3].
| 0.36 | 0.24 | 0.26 | 0.14 | 0.23 | 0.17 | 0.15 | 0.07 | |
| 0.59 | 0.54 | 0.49 | 0.45 | 0.44 | 0.44 | 0.43 | 0.31 |
Figure 1Sensitivity curves for the three best-performing measures. The fraction of correctly ranked homologous VAST neighbors (true positives, sensitivity) is plotted against the fraction of incorrectly ranked homologous VAST neighbors for similarity measures yielding the best performance: HCS (green)), LHM (red) and GSAS (cyan). True and false positive (error) rate values at each cutoff of similarity measures were averaged over protein families from the overall test set (152 families).
Figure 2Performance on families of differing degrees of difficulty. The barplot shows the sensitivity at 5% error rate for each bin of ranking difficulty. Ranking difficulty is estimated as an average percent identity between the query structure and non-redundant set of true positive structures (homologous VAST neighbors) for each CDD family. Each bin of percent identity contains at least five CDD families within a given range of ranking difficulty and sensitivity is averaged over the sensitivities of CDD families within a given bin. CDD families were chosen here as those with at least 20 non-redundant VAST neighbors. There are 13 CDD families in the 0–10% bin; 52 in the 10–20% bin; 21 in the 20–30% bin; and 11 in the 30–100% bin (97 CDD families altogether).
Difficult families. Difficult families for all of the measures. For these seven CDD families all of the six measures had sensitivities of under 0.50 at the 5% error rate.
| cd00945 | Aldolase_Class_I | TIM β/α barrel |
| cd00529 | RuvC_resolvase | Ribonuclease H-like motif |
| cd01120 | RecA-like_NTPases | P-loop containing nucleoside triphosphate hydrolases |
| cd00079 | HELICc | P-loop containing nucleoside triphosphate hydrolases |
| cd00453 | FTBP_aldolase_II | TIM β/α barrel |
| cd00102 | IPT | Immunoglobulin-like β-sandwich |
| cd00234 | RPA14 | OB-fold |