| Literature DB >> 21873645 |
Aleksandar Stojmirović1, Yi-Kuo Yu.
Abstract
Robust advances in interactome analysis demand comprehensive, non-redundant and consistently annotated data sets. By non-redundant, we mean that the accounting of evidence for every interaction should be faithful: each independent experimental support is counted exactly once, no more, no less. While many interactions are shared among public repositories, none of them contains the complete known interactome for any model organism. In addition, the annotations of the same experimental result by different repositories often disagree. This brings up the issue of which annotation to keep while consolidating evidences that are the same. The iRefIndex database, including interactions from most popular repositories with a standardized protein nomenclature, represents a significant advance in all aspects, especially in comprehensiveness. However, iRefIndex aims to maintain all information/annotation from original sources and requires users to perform additional processing to fully achieve the aforementioned goals. Another issue has to do with protein complexes. Some databases represent experimentally observed complexes as interactions with more than two participants, while others expand them into binary interactions using spoke or matrix model. To avoid untested interaction information buildup, it is preferable to replace the expanded protein complexes, either from spoke or matrix models, with a flat list of complex members. To address these issues and to achieve our goals, we have developed ppiTrim, a script that processes iRefIndex to produce non-redundant, consistently annotated data sets of physical interactions. Our script proceeds in three stages: mapping all interactants to gene identifiers and removing all undesired raw interactions, deflating potentially expanded complexes, and reconciling for each interaction the annotation labels among different source databases. As an illustration, we have processed the three largest organismal data sets: yeast, human and fruitfly. While ppiTrim can resolve most apparent conflicts between different labelings, we also discovered some unresolvable disagreements mostly resulting from different annotation policies among repositories. Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads/ppiTrim.html.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21873645 PMCID: PMC3162744 DOI: 10.1093/database/bar036
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.ppiTrim uses two procedures for complex deflation: pattern detection (top) and template matching (bottom). As an example, assume that a graph ABCDEFG, shown on the left, could be constructed from complex candidate interactions annotated by the BioGRID from a single publication. The arrows indicate bait to prey relationships, with the interaction A–D being repeated twice, once with A and once with D as a bait. Pattern-detection algorithm (top) would recognize A and D as hubs of potentially spoke-expanded complexes and thus replace all pairwise interactions on the left with complexes ABCDEF and ACDEFG. Suppose that the complex ACDEF was reported from the same publication by a different database. Then, template matching procedure (bottom) would generate the complex ACDEF (with all other annotation, such as experimental detection method, retained from the original interactions) and remove all original interactions except D–G and A–B. After performing both procedures, ppiTrim consolidates the results so that the overall result would be replacing the original interactions by complexes ACDEF, ABCDEF and ACDEFG with edge type codes `R', `A' and `A', respectively. The interactions A–B and D–G would not be retained since they are contained within the deflated complexes ABCDEF and ACDEFG.
Edge type codes used by ppiTrim
| Code | Description |
|---|---|
| X | Undirected binary interaction (physical binding) |
| D | Directed binary interaction (biochemical reaction) |
| B | Biochemical reaction without indication of directionality |
| C | Original complex (from iRefIndex) |
| G | Spoke-expanded complex; deflated by pattern matching from BioGRID's 'Co-purification' and 'Co-fractionation' categories (reliable) |
| R | Potential spoke-expanded complex; deflated by template matching of a `C'-complex |
| A | Potential spoke-expanded complex (BioGRID only); deflated by pattern detection |
| N | Potential spoke-expanded complex; deflated by template matching of a `G'- or `A'-complex |
Figure 2.The picture shows a part of the PSI-MI ontology graph for interaction-detection method associated with a hypothetical cluster of source interactions involving the same interactants from the same publication. The terms colored blue are associated with the source interactions within the cluster, while those marked yellow and green are present in the ontology but do not label any source interaction from the cluster. The entire cluster as shown is consistent, with the term MI:0401 as the maximal element. Its finest consistent term is MI:0004 (colored green) since the cluster members smaller than it are not comparable between themselves. Removing the source interactions labeled by MI:0401 from the cluster would result in three distinct subclusters. If two subclusters contain no interaction from the same source database, they would be reported as conflicts.
Processing source interactions
| Species | Initial | Removed | Without Gene ID | Retained | With mapped Gene ID |
|---|---|---|---|---|---|
| 400 449 | 173 815 | 3608 | 223 026 | 880 | |
| 382 094 | 148 724 | 2738 | 230 632 | 161 87 | |
| 154 770 | 324 77 | 9476 | 112 817 | 3427 |
Statistics of initial processing of raw interactions from iRefIndex. Shown are the initial number, total number removed due to filtering criteria, number removed due to missing Gene ID, total number of retained and the number retained containing at least one interactant with mapped Gene ID.
Mapping CROGIDs from iRefIndex into Gene IDs
| Species | Initial CROGIDs | Aditional mapped | Final | ||||
|---|---|---|---|---|---|---|---|
| Total | Mapped | Orphans | Total | Valid | CROGIDs | Gene IDs | |
| 6159 | 5552 | 607 | 433 | 47 | 5599 | 5618 | |
| 14 047 | 11 432 | 2615 | 1261 | 1261 | 12693 | 11786 | |
| 9379 | 7810 | 1569 | 566 | 566 | 8346 | 7846 | |
Statistics of mapping CROGIDs into Gene IDs. Columns 2–4 show the total number of CROGIDs considered, the number that could be directly mapped to GeneIDs and the number of `orphans' that are not associated with a Gene ID in the iRefIndex file. Columns 5 and 6 show the number of CROGIDs additionally mapped to GeneIDs, while the last two columns show the final number of CROGIDs accepted and the corresponding number of Gene IDs. It is possible for a CROGID to map to multiple Gene IDs (if multiple genes encode the same protein sequence) as well as for multiple CROGIDs to map to a single GeneID (if our additional mapping links them to the same gene).
Deflating spoke-expanded complexes
| Species | Publications | Pairs | Complexes | |||||
|---|---|---|---|---|---|---|---|---|
| Initial | Remaining | C | G | R | A | N | ||
| 3924 | 118 819 | 28 643 | 7729 | 323 | 5384 | 3190 | 1311 | |
| 10 317 | 56 111 | 35 650 | 8382 | 181 | 1143 | 1443 | 304 | |
| 398 | 1722 | 1053 | 220 | 16 | 82 | 33 | 3 | |
Shown are the numbers of complexes obtained by deflating binary interactions with affinity chromatography (or related) as experimental method. Types of complexes are indicated by one-letter codes described in Table 1. The counts of pairs shown include those from publications with fewer than three interactions (per database), which could never be deflated into complexes.
Final consolidated data sets
| Species | Publications | Input pairs | Consolidated | Conflicts | ||||
|---|---|---|---|---|---|---|---|---|
| Biochem | Other | Complexes | Directed | Undirected | Resolvable | Unresolvable | ||
| 6303 | 5780 | 119 329 | 10 778 | 5525 | 63 648 | 19 344 | 454 | |
| 22 660 | 2446 | 199 094 | 6483 | 2042 | 85 480 | 26 478 | 1333 | |
| 564 | 51 | 111 862 | 227 | 33 | 27 981 | 19 430 | 11 | |
For each species, shown are the numbers of input pairs (input complexes are those from Table 4), classified as either biochemical reactions (potentially directed) or others; also shown are the final numbers of consolidated interactions (classified as complexes, directed or undirected). The `other' column accounts only for those interactions that were not deflated into complexes in Phase II. The last two columns show the total numbers of resolvable and unresolvable conflicts between consolidated interactions. An unresolvable conflict is an instance where two consolidated interactions, originated from the same publication, are reported using incompatible experimental detection method labels by different databases. A resolvable conflict is the case where source interactions within a single consolidated interaction have different (but compatible) experimental detection method labels.
Most common interaction detection method PSI-MI term conflicts
| Term A | Sources A | Term B | Sources B | Counts |
|---|---|---|---|---|
| MI:0007 (anti-tag coimmunoprecipitation) | M | MI:0676 (tandem affinity purification) | DI | 132 |
| MI:0004 (affinity chromatography) | B | MI:0363 (inferred by author) | I | 60 |
| MI:0018 (two hybrid) | DIMN | MI:0096 (pull down) | BI | 43 |
| MI:0071 (molecular sieving) | DIN | MI:0096 (pull down) | B | 32 |
| MI:0030 (cross linking study) | DIMN | MI:0096 (pull down) | B | 22 |
| MI:0007 (anti-tag coimmunoprecipitation) | IM | MI:0676 (tandem affinity purification) | DI | 1227 |
| MI:0018 (two hybrid) | BDHIM | MI:0096 (pull down) | BM | 17 |
| MI:0096 (pull down) | B | MI:0107 (surface plasmon resonance) | DM | 6 |
| MI:0008 (array technology) | I | MI:0049 (filter binding) | M | 5 |
| MI:0019 (coimmunoprecipitation) | IM | MI:0096 (pull down) | BI | 5 |
Top five most common interaction detection method PSI-MI term unresolvable conflicts for yeast (top) and human (bottom) data sets are shown. Source databases are indicated by one-letter codes B (BioGRID), D (DIP), I (IntAct), H (HPRD), M (MINT) and P (MPPI).