| Literature DB >> 19417074 |
Bong-Hyun Kim1, Hua Cheng, Nick V Grishin.
Abstract
The biological properties of proteins are often gleaned through comparative analysis of evolutionary relatives. Although protein structure similarity search methods detect more distant homologs than purely sequence-based methods, structural resemblance can result from either homology (common ancestry) or analogy (similarity without common ancestry). While many existing web servers detect structural neighbors, they do not explicitly address the question of homology versus analogy. Here, we present a web server named HorA (Homology or Analogy) that identifies likely homologs for a query protein structure. Unlike other servers, HorA combines sequence information from state-of-the-art profile methods with structure information from spatial similarity measures using an advanced computational technique. HorA aims to identify biologically meaningful connections rather than purely 3D-geometric similarities. The HorA method finds approximately 90% of remote homologs defined in the manually curated database SCOP. HorA will be especially useful for finding remote homologs that might be overlooked by other sequence or structural similarity search servers. The HorA server is available at http://prodata.swmed.edu/horaserver.Entities:
Mesh:
Year: 2009 PMID: 19417074 PMCID: PMC2703895 DOI: 10.1093/nar/gkp328
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Overview of the HorA server ‘accurate’ procedure.
Figure 2.Result pages from the HorA server. (a) Result page of a database search. (b) Result page of a pairwise comparison.
Performance on different data sets
| Total number of pairs | 241 | 130 | 25 792 | 67 283 | 121 805 | 5 293 101 | 20 602 882 |
| Accuracy (%) | 96.3 | 90.8 | 98.2 | 92.0 | 27.4 | 89.0 | 99.7 |
SCOP1.69 domains with less than 40% sequence identity obtained from ASTRAL (21) are paired in an all-on-all fashion. These pairs are parsed into five subsets: FA (two domains are in the same SCOP family), SF (two domains are from different families but same superfamily), FD (two domains are from different superfamilies but same fold), CL (from different folds but same class) and RT (from different classes). Manual homologs (MH) (17) and manual analogs (MA) (18) are manually prepared data sets. Domain pairs in MH, FA and SF are labeled as ‘homologs’, while pairs in MA, FD, CL and RT are labeled as ‘non-homologs’. Therefore, in calculating accuracies, classifying a MH, FA, or SF pair to be homologous is regarded as a ‘correct’ classification, while classifying a MA, FD, CL or RT pair to be homologous is regarded as a ‘wrong’ classification. The accuracy equals the number of ‘correct’ classifications divided by the total number of pairs in that data set. 3000 SF and 3000 FD pairs were used in training the SVM model (see ‘Methods’ section, SVM model).
Figure 3.Comparison of the top DALI hit and the top HorA hit for an EF-hand query. (a) Left: query domain (PDB 1eg3, A124–A209). Right: first DALI hit (PDB 1ls1, A1–A88). According to the DALI alignment between these two domains, structurally equivalent parts are represented in the same color, while unaligned parts are in gray. Coloring starts from blue (N-terminus) and ends in red (C-terminus). (b) Left: the same query domain as in (a); Right: first HorA hit (PDB 1uhn, A118–A197). Colored as in (a).