| Literature DB >> 15663796 |
Antje Krause1, Jens Stoye, Martin Vingron.
Abstract
BACKGROUND: Searching a biological sequence database with a query sequence looking for homologues has become a routine operation in computational biology. In spite of the high degree of sophistication of currently available search routines it is still virtually impossible to identify quickly and clearly a group of sequences that a given query sequence belongs to.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15663796 PMCID: PMC547898 DOI: 10.1186/1471-2105-6-15
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Comparison of SYSTERS, TribeMCL and single linkage clustering (SLC) on Pfam and ENZYME data sets (J: Jaccard Coeffcient, SE: Sensitivity, SP: Specificity). The best result in each row is shown in bold face. For the single linkage clustering only the results of the "best" clustering are shown together with the corresponding cutoff E-value. In the case of Sensitivity/Specificity these values were choosen according to the intercept point of the two curves when plotting the values for all possible E-value cutoffs. All clustering procedures were applied to the non-redundant data set and redundant sequences were added to the cluster sets again to compare to the "true" cluster sets: 33,963,365 pairwise values of 283,113 non-redundant sequences used for clustering and 442,872 redundant sequences used in comparison; 1,582,948 pairwise values of 38,176 non-redundant sequences used for clustering and 84,405 redundant sequences used in comparison.
| SLC | SYSTERS | TribeMCL at Inflation | |||||||
| best | at cutoff | Superfam. | Subclust. | 1.1 | 2 | 3 | 4 | 5 | |
| Pfam | |||||||||
| J | 0.19362 | 1e-53 | 0.15637 | --- | --- | --- | --- | --- | |
| SE | 0.26886 | 1e-49 | 0.48302 | --- | --- | --- | --- | --- | |
| SP | 0.26536 | 1e-49 | 0.17902 | --- | --- | --- | --- | --- | |
| ENZYME | |||||||||
| A.B.C.D | |||||||||
| J | 0.88760 | 1e-21 | 0.77445 | 0.60390 | 0.60074 | 0.59990 | 0.59942 | 0.59778 | |
| SE | 0.92295 | 1e-08 | 0.92297 | 0.61323 | 0.60328 | 0.60224 | 0.60164 | 0.59989 | |
| SP | 0.93616 | 1e-08 | 0.82294 | 0.96924 | 0.97543 | 0.99304 | 0.99357 | 0.99388 | |
| A.B.C.? | |||||||||
| J | 0.71527 | 1e-15 | 0.65915 | 0.48721 | 0.47900 | 0.47803 | 0.47746 | 0.47600 | |
| SE | 0.74985 | 1e-03 | 0.73727 | 0.49099 | 0.47996 | 0.47895 | 0.47836 | 0.47688 | |
| SP | 0.80855 | 1e-03 | 0.84073 | 0.97592 | 0.98445 | 0.99586 | 0.99601 | 0.99608 | |
Figure 1Multi domain proteins Sequences with different domain compositions belong to the same family of Adenylate cyclases but form different "true" clusters (Pfam domains: RA: Ras association (RalGDS/AF-6) domain; LRR: Leucine Rich Repeats; PP2C: Protein phosphatase 2C; guanylate_cyc: Adenylate and Guanylate cyclase catalytic domain)
Figure 2Schematic overview of the clustering procedures We start with a single linkage tree constructed from pairwise distances. Each leaf in the tree corresponds to a protein sequence. Superfamilies are determined based on the internal structure of the tree. For each superfamily a distinct superfamily distance graph is built. This weighted graph is cut at weak connections into subclusters.
Figure 3The SYSTERS algorithms
Figure 4Excerpt from the single linkage tree The superfamily of sequence O93431 is determined as follows (traversing the tree along the branches depicted as bold lines). The first internal node connects this sequence with the four sequences P52794, P20827, P52793, and P97553 at an E-value of 1e-52. Thus, the ratio of the size of the merging subtree and the size of the current subtree at this point is 4/1. Stepping up the hierarchy, the next node (E-value 4e-38) connects these five sequences with a subtree consisting of 13 sequences, resulting in a ratio of 13/5 (= 2.6). Stepping further up the hierarchy, the following ratios are 1/18 (= 0.056 at E-value 6e-38), 2/19 (= 0.105 at E-value 2e-37), 15/21 (= 0.714 at E-value 2e-13), 1/36 (= 0.028 at E-value 5e-10), 211 975/37 (= 5729.054 at E-value 0.022), 259/212 012 (= 0.001 at E-value 0.023), etc. Taking the maximum of the ratios we find the superfamily root at E-value 5e-10 as the last node before the largest relative increase (depicted as a bullet in the tree). The superfamily of sequence O93431 hence consists of the 37 sequences belonging to the ephrin type A and type B families plus a few predicted proteins.
Figure 5The superfamily distance graph of the ephrin superfamily The graph contains only those edges which represent E-values of at least the superfamily cutoff 5e-10. The width of an edge is according to its E-value, here ranging from 5e-10 (thinnest edge) to 3e-149 (thickest edge). The subclustering procedure first splits off nodes from the bottom right of the graph as single sequence clusters. These sequences are predicted proteins which are not yet confirmed as functioning by any experiment. The last accepted split in the graph results in the partitioning into the two major groups of ephrin type A (left) and type B (right) sequences as shown by the dashed line. Single sequence clusters are added to the ephrin type B family in the subsequent singleton adoption step.