| Literature DB >> 15603591 |
James A Casbon1, Mansoor A S Saqi.
Abstract
BACKGROUND: Annotation of sequences that share little similarity to sequences of known function remains a major obstacle in genome annotation. Some of the best methods of detecting remote relationships between protein sequences are based on matching sequence profiles. We analyse the superfamily specific performance of sequence profile-profile matching. Our benchmark consists of a set of 16 protein superfamilies that are highly diverse at the sequence level. We relate the performance to the number of sequences in the profiles, the profile diversity and the extent of structural conservation in the superfamily.Entities:
Mesh:
Substances:
Year: 2004 PMID: 15603591 PMCID: PMC543460 DOI: 10.1186/1471-2105-5-200
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Profile width and Neff for dataset
| Profile Width | Neff | |||
| Superfamily | Full | Trimmed | Full | Trimmed |
| (Trans)glycosidases | 410.4 | 23.93 | 13.11 | 3.21 |
| 4-helical cytokines | 85.71 | 43.57 | 4.3 | 2.86 |
| alpha/beta-Hydrolases | 509.43 | 22.32 | 16.32 | 3.65 |
| Cytochrome c | 413.62 | 18.86 | 12.64 | 3.7 |
| E Set domains | 182.73 | 33.27 | 7.99 | 3.16 |
| FAD/NAD(P)-binding | 616.52 | 20.57 | 15.33 | 3.68 |
| Fibronectin type | 1661.67 | 24.83 | 11.44 | 3.55 |
| Homeodomain-like | 255.21 | 39.33 | 7 | 3.34 |
| Immunoglobulin | 1614.7 | 69.04 | 11.33 | 3.65 |
| NAD(P)-binding | 463.14 | 29.55 | 12.32 | 3.27 |
| Nucleic acid-binding | 224.09 | 23.57 | 8.21 | 3.11 |
| P-loop | 483.03 | 26.44 | 11.64 | 2.92 |
| S-adenosyl | 472.42 | 22.08 | 14.88 | 3.22 |
| Thioredoxin-like | 471.72 | 25.28 | 12.61 | 3.58 |
| Viral coat | 265.28 | 35.93 | 6.11 | 2.96 |
| Winged helix | 206.94 | 24.81 | 8.11 | 3.13 |
Figure 1ROC10 values for each superfamily in the dataset for full and trimmed profiles.
Figure 2Mean RMSD values for superfamilies in the dataset. Error bars show one standard deviation.
Figure 3ROC10 of the trimmed profiles versus average pairwise RMSD. Error bars show one standard deviation.
Figure 4ROC10 of the trimmed profiles versus conservation for superfamilies in our dataset.
Figure 5Mean RMSD of the trimmed profiles versus conservation for superfamilies in our dataset.
Properties of the dataset
| Superfamily | sunid | Families | Domains | Length | RMSD |
| (Trans)glycosidases | 51445 | 9 | 30 | 385.2 | 2.64 |
| 4-helical cytokines | 47266 | 3 | 21 | 146.76 | 3.12 |
| alpha/beta-Hydrolases | 53474 | 22 | 29 | 302.28 | 1.87 |
| Cytochrome c | 46626 | 8 | 21 | 116.14 | 1.6 |
| E Set domains | 81296 | 17 | 33 | 120.21 | 2.49 |
| FAD/NAD(P)-binding | 51905 | 5 | 21 | 244.1 | 1.47 |
| Fibronectin type | 49265 | 1 | 24 | 103.42 | 1.52 |
| Homeodomain-like | 46689 | 10 | 24 | 72.17 | 2.43 |
| Immunoglobulin | 48726 | 4 | 47 | 103.23 | 1.63 |
| NAD(P)-binding | 51735 | 10 | 49 | 202.67 | 1.89 |
| Nucleic acid-binding | 50249 | 10 | 44 | 120.86 | 2.05 |
| P-loop | 52540 | 18 | 70 | 257.64 | 3.99 |
| S-adenosyl | 53335 | 20 | 24 | 255.92 | 1.92 |
| Thioredoxin-like | 52833 | 12 | 29 | 121.72 | 1.88 |
| Viral coat | 49611 | 4 | 29 | 271.07 | 3.43 |
| Winged helix | 46785 | 35 | 48 | 92.65 | 2.33 |