| Literature DB >> 18838004 |
Lisa E M McMillan1, Andrew C R Martin.
Abstract
BACKGROUND: There is a frequent need to obtain sets of functionally equivalent homologous proteins (FEPs) from different species. While it is usually the case that orthology implies functional equivalence, this is not always true; therefore datasets of orthologous proteins are not appropriate. The information relevant to extracting FEPs is contained in databanks such as UniProtKB/Swiss-Prot and a manual analysis of these data allow FEPs to be extracted on a one-off basis. However there has been no resource allowing the easy, automatic extraction of groups of FEPs - for example, all instances of protein C.We have developed FOSTA, an automatically generated database of FEPs annotated as having the same function in UniProtKB/Swiss-Prot which can be used for large-scale analysis. The method builds a candidate list of homologues and filters out functionally diverged proteins on the basis of functional annotations using a simple text mining approach.Entities:
Mesh:
Year: 2008 PMID: 18838004 PMCID: PMC2576269 DOI: 10.1186/1471-2105-9-418
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Zebrafish candidates for the FOSTA family of HXB7_HUMAN
| HXB7_HUMAN | 100 | Homeobox protein Hox-B7; Hox-2C; HHO.C1 |
| HXB7A_DANRE | 54 | Homeobox protein Hox-B7a; Hox-B7 |
| HXA1A_DANRE | 63 | Homeobox protein Hox-A1a; Hox-A1 |
| HXA3A_DANRE | 68 | Homeobox protein Hox-A3a |
| HXA4A_DANRE | 65 | Homeobox protein Hox-A4a; Zf-26; Hoxx4 |
| HXA5A_DANRE | 75 | Homeobox protein Hox-A5a |
| HXA9B_DANRE | 62 | Homeobox protein Hox-A9b |
| HXB1A_DANRE | 64 | Homeobox protein Hox-B1a; Hox-B1 |
| HXB1B_DANRE | 64 | Homeobox protein Hox-B1b; Hox-A1 |
| HXB2A_DANRE | 57 | Homeobox protein Hox-B2a; Hox-B2 |
| HXB3A_DANRE | 67 | Homeobox protein Hox-B3a; Hox-B3 |
| HXB4A_DANRE | 62 | Homeobox protein Hox-B4a; Hox-B4; Zf-13 |
| HXB5A_DANRE | 75 | Homeobox protein Hox-B5a; Hox-B5; Zf-21 |
| HXB5B_DANRE | 75 | Homeobox protein Hox-B5b; Hox-B5-like; Zf-54 |
| HXB6A_DANRE | 78 | Homeobox protein Hox-B6a; Hox-B6; Zf-22 |
| HXB6B_DANRE | 75 | Homeobox protein Hox-B6b; Hox-A7 |
| HXB8B_DANRE | 60 | Homeobox protein Hox-B8b; Hox-A8 |
| HXC1A_DANRE | 62 | Homeobox protein Hox-C1a |
| HXC3A_DANRE | 61 | Homeobox protein Hox-C3a; Hox-114; Zf-114 |
| HXC5A_DANRE | 72 | Homeobox protein Hox-C5a; Hox-C5; Hox-3.4; Zf-25 |
| HXC6A_DANRE | 63 | Homeobox protein Hox-C6a; Hox-C6; Zf-61 |
| HXC6B_DANRE | 77 | Homeobox protein Hox-C6b |
| HXC8A_DANRE | 73 | Homeobox protein Hox-C8a |
| HXD4A_DANRE | 62 | Homeobox protein Hox-D4a; Hox-D4 |
| HXD9A_DANRE | 65 | Homeobox protein Hox-D9a; Hox-D9 |
| HXDAA_DANRE | 61 | Homeobox protein Hox-D10a; Hox-D10; Hox-C10 |
Protein: The UniProtKB/Swiss-Prot ID; ID: The sequence identity of the Protein to HXB7_HUMAN;
Description: The UniProtKB/Swiss-Prot description (DE) field.
Functional sites in HXB7_HUMAN
| DNA binding (homeobox) | 137 – 197 | UniProtKB/Swiss-Prot FT/DNA_BIND annotation |
| Crosslink (glycyl lysine isopeptide) | 191 & 193 | UniProtKB/Swiss-Prot FT/CROSSLNK annotation |
| Motif (Antp-type hexapeptide) | 126 – 131 | UniProtKB/Swiss-Prot FT/MOTIF annotation |
| Hypothesized binding to PBX | 129 – 130 | Yaron |
| Putative CKII target | 132 – 133 | Yaron |
| Putative CKII target | 203 – 204 | Yaron |
Functional site: a description of the functional site; Location: the residue number in HXB7_HUMAN;
Reference: The source of the annotation.
Figure 1Verifying the Residues identical to that of HXB7_HUMAN are in bold capitals and highlighted yellow, mismatching residues are non-captials and highlighted in light grey. The root human protein (HXB7_HUMAN) is indicated in the red box, and the assigned Zebrafish is highlighted in the blue box. The position relative to HXB7_HUMAN is given on the top line, and the asterisks on the bottom line highlight fully conserved columns.
Benchmarking FOSTA against the PIRSF dataset
| 122 | 2127 | 1744 | 2 | 3717 | 383 | 99.89 | 0.86 | |
| 1095 | 18865 | 12967 | 23 | 34656 | 5898 | 99.82 | 0.77 | |
| 474 | 11221 | 9146 | 62 | 11819 | 2075 | 99.33 | 0.83 | |
| 339 | 5287 | 3674 | 16 | 4938 | 1613 | 99.57 | 0.72 | |
| 1691 | 32213 | 23857 | 87 | 50192 | 8356 | 99.64 | 0.79 | |
| 2020 | 37500 | 27531 | 103 | 55130 | 9969 | 99.63 | 0.79 | |
Set ID: the identifier for each curation set [A='Full/Desc.', B='Full', C='Preliminary', D='None', N=aNnotated (A+B+C), * = All (N+D)]; Curation string: the string that defines the curation set; Families: the number of discrete protein families in the curation set; Pairings: the number of discrete pairings across all families to be tested in FOSTA; Basic statistics: the basic counts of true positives (TP), false positives (FP), true negatives (TN), false negatives (FN); Evaluation statistics: the PPV (positive predictive value, TP/(TP + FP)), and the MCC (Matthews Correlation Coefficient), all rounded to 2dp
Benchmarking FOSTA against the refined Hulsen et al. dataset
| 2 | (9) | 2 | 0 | 17 | 0 | 100.00 | 1.00 | |
| 30 | (41) | 30 | 0 | 3853 | 0 | 100.00 | 1.00 | |
| 12 | (17) | 12 | 0 | 22 | 0 | 100.00 | 1.00 | |
| 6 | (6) | 6 | 0 | 5 | 0 | 100.00 | 1.00 | |
| 4 | (29) | 1 | 1 | 327 | 3 | 50.00 | 0.35 | |
| 54 | (102) | 51 | 1 | 4224 | 3 | 98.08 | 0.96 | |
Protein family: the protein family being examined; TO pairings: the number of TO pairs in the Hulsen dataset (including many-to-many orthologous pairings and non-UniProtKB/Swiss-Prot proteins); Refined pairings: the number of one-to-one TO pairings tested after refinement of Hulsen TO dataset; Basic statistics: the basic counts of true positives (TP), false positives (FP), true negatives (TN), false negatives (FN); Evaluation statistics: the PPV (positive predictive value, TP/(TP + FP)), and the MCC (Matthews Correlation Coefficient), all rounded to 2dp)
Comparing FOSTA with Inparanoid
| APIME | 1 | 1 | 0 | 100.00% | - | - | |
| BOSTA | 3508 | 3451 | 57 | 98.38% | 1 | 56 | |
| CANFA | 533 | 520 | 13 | 97.56% | 1 | 12 | |
| CIOIN | 6 | 5 | 1 | 83.33% | 0 | 1 | |
| DANRE | 1246 | 1192 | 54 | 95.67% | 21 | 33 | |
| DICDI | 85 | 69 | 16 | 81.18% | 0 | 16 | |
| DROME | 878 | 712 | 166 | 81.09% | 14 | 152 | |
| DROPS | 73 | 67 | 6 | 91.78% | 0 | 6 | |
| GALGA | 1360 | 1297 | 63 | 95.37% | 12 | 51 | |
| GASAC | 1 | 1 | 0 | 100.00% | - | - | |
| MACMU | 214 | 207 | 7 | 96.73% | 0 | 7 | |
| MONDO | 22 | 21 | 1 | 95.45% | 0 | 1 | |
| MUSMU | 12063 | 11960 | 103 | 99.15% | 18 | 85 | |
| ORYSA | 1 | 0 | 1 | 0.00% | 0 | 1 | |
| PANTR | 412 | 408 | 4 | 99.03% | 1 | 3 | |
| RATNO | 5076 | 5005 | 71 | 98.60% | 6 | 65 | |
| SACCE | 1213 | 787 | 426 | 64.88% | 49 | 377 | |
| TETNI | 6 | 6 | 0 | 100.00% | - | - | |
| XENTR | 371 | 364 | 7 | 98.11% | 2 | 5 | |
| - | All species | 27069 | 26073 | 996 | 96.32% | 125 | 871 |
Code: The species code as used by Inparanoid; Species: The full species name; Pairs: The number of one-to-one orthologue pairs described by Inparanoid between Species and Human; Matches: The number of one-to-one Inparanoid orthologue pairs (IPs) that are also found by FOSTA; Mismatches: The number of IPs pairs that are not found by FOSTA; % match: The percentage of IPs that are also found by FOSTA; Overlooked: The number of IPs where FOSTA assigns a different protein from the Species to the FOSTA family of the human protei; Rejected: The number of IPs where FOSTA does not assign any protein from the Species to the FOSTA family of the human protein.
Example insensitivities in the FOSTA functional match methodology
| CC45L_HUMAN | CDC45-related protein; PORC-PI-1; Cdc45 |
| CDC45_YEAST | Cell division control protein 45 |
| FGF17_HUMAN | Fibroblast growth factor 17 precursor; FGF-17 |
| FG17B_DANRE | Fibroblast growth factor 17b precursor; FGF-17b |
Figure 2The FOSTA filtering process: homologues are identified by BLAST-ing against the UniProtKB/Swiss-Prot database (filtering stage (1)); these are then filtered to retain only those with similar function (filtering stage (2)); finally one protein per species (the FEP, or functionally equivalent protein) is chosen using a hierarchy of functional matches to eliminate functionally diverged homologues (FDHs) (filtering stage (3)).