| Literature DB >> 15755318 |
Abstract
BACKGROUND: Computational protein annotation methods occasionally introduce errors. False-positive (FP) errors are annotations that are mistakenly associated with a protein. Such false annotations introduce errors that may spread into databases through similarity with other proteins. Generally, methods used to minimize the chance for FPs result in decreased sensitivity or low throughput. We present a novel protein-clustering method that enables automatic separation of FP from true hits. The method quantifies the biological similarity between pairs of proteins by examining each protein's annotations, and then proceeds by clustering sets of proteins that received similar annotation into biological groups.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15755318 PMCID: PMC555558 DOI: 10.1186/1471-2105-6-46
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Similarity score calculation
| Antigen | 1 | 1 | 0.007130 | 4.9435114 | |
| Lysosome | 1 | 1 | 0.001929 | 6.2506136 | |
| Glycoprotein | 1 | 1 | 0.094727 | 2.3567562 | |
| Transmembrane | 1 | 1 | 0.159770 | 1.8340200 | |
| Alternative splicing | 0 | 1 | 0.029281 | - | |
| Signal | 0 | 1 | 0.123850 | - | |
| Repeat | 0 | 1 | 0.078968 | - | |
| Serum albumin family | 1 | 0 | 0.000342 | - | |
| CD9/CD37/CD63 antigen | 1 | 0 | 0.000666 | - | |
| Lysosome-associated membrane glycoprotein (lamp)/CD68 | 0 | 1 | 0.000123 | - | |
| Membrane | 1 | 1 | 0.210869 | 1.5565182 | |
| Lysosome | 1 | 1 | 0.002043 | 6.1932038 | |
| Vacuole | 1 | 1 | 0.002184 | 6.1267895 | |
| Lytic vacuole | 1 | 1 | 0.002043 | 6.1932038 | |
| Cell | 1 | 1 | 0.440206 | 0.8205125 | |
| Integral membrane protein | 1 | 1 | 0.160874 | 1.8271338 | |
| Cytoplasm | 1 | 1 | 0.186569 | 1.6789541 | |
| Intracellular | 1 | 1 | 0.307578 | 1.1790266 | |
The table shows a calculation of the similarity score between two SwissProt proteins: Rabbit CD63 antigen (CD63_RABIT) and Human Microsialin precursor (CD68_HUMAN). The similarity score is the summation of -ln(freq) on all annotations that are shared by both proteins. a – 1 or 0 indicate if the given protein has or does not have the annotation respectively. b – The frequency is the portion of proteins in the database that have the annotation.
Figure 1Biological clustering example. The figure shows a dendrogram describing the clustering of 37 proteins that matched the PROSITE "Serum Albumin Family" signature. The clustering advances from right to left along the axis that shows the similarity score at each point of the process. The vertical axis shows 16 initial clusters of proteins that the clustering starts with after the initial clustering stage. The initial clusters are numbered 1–16 and in parentheses show the number of proteins in them. Clusters 1–3 contain 5 Vitamin D Binding proteins (TPs). Clusters 4–13 contain 24 Albumin proteins (TPs). Cluster 14 contains 3 Afamin proteins (TPs). Clusters 15 and 16 contain the 5 FPs. The colors indicate the correct separation of this set into TPs and FPs.
Figure 2Similarity score plot. The figure shows the similarity score (solid line) plotted versus the progression of the clustering process for a sample protein set that was tested. The protein set includes 606 proteins that were annotated as "Rhodopsin-like GPCR superfamily". The score decreases from left to right as the clusters are merged, indicating decreasing biological similarity. The vertical dashed line indicates the correct halting step. Note that the correct halting step is located where there is a distinct knee in the graph, indicating a point of stability in the process.
Figure 3Relative success and failure in group size categories. The figure shows the relative success and failure of the clustering method in different categories of group sizes (the group size of an annotations is the number of proteins that have the annotation). All tested sets were grouped into 30 categories according to the amount of proteins in them, from 0 to 1500 proteins (shown on the horizontal axis). Each category shows the relative amount of success (purple) and failure (blue) of the method in each of the categories. It is apparent that relative success decreases as the group sizes increase.