| Literature DB >> 28077563 |
Rezarta Islamaj Dogan1, Sun Kim1, Andrew Chatr-Aryamontri2, Christie S Chang3, Rose Oughtred3, Jennifer Rust3, W John Wilbur1, Donald C Comeau1, Kara Dolinski3, Mike Tyers2,4.
Abstract
A great deal of information on the molecular genetics and biochemistry of model organisms has been reported in the scientific literature. However, this data is typically described in free text form and is not readily amenable to computational analyses. To this end, the BioGRID database systematically curates the biomedical literature for genetic and protein interaction data. This data is provided in a standardized computationally tractable format and includes structured annotation of experimental evidence. BioGRID curation necessarily involves substantial human effort by expert curators who must read each publication to extract the relevant information. Computational text-mining methods offer the potential to augment and accelerate manual curation. To facilitate the development of practical text-mining strategies, a new challenge was organized in BioCreative V for the BioC task, the collaborative Biocurator Assistant Task. This was a non-competitive, cooperative task in which the participants worked together to build BioC-compatible modules into an integrated pipeline to assist BioGRID curators. As an integral part of this task, a test collection of full text articles was developed that contained both biological entity annotations (gene/protein and organism/species) and molecular interaction annotations (protein-protein and genetic interactions (PPIs and GIs)). This collection, which we call the BioC-BioGRID corpus, was annotated by four BioGRID curators over three rounds of annotation and contains 120 full text articles curated in a dataset representing two major model organisms, namely budding yeast and human. The BioC-BioGRID corpus contains annotations for 6409 mentions of genes and their Entrez Gene IDs, 186 mentions of organism names and their NCBI Taxonomy IDs, 1867 mentions of PPIs and 701 annotations of PPI experimental evidence statements, 856 mentions of GIs and 399 annotations of GI evidence statements. The purpose, characteristics and possible future uses of the BioC-BioGRID corpus are detailed in this report.Database URL: http://bioc.sourceforge.net/BioC-BioGRID.html. Published by Oxford University Press 2017. This work is written by US Government employees and is in the public domain in the US.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28077563 PMCID: PMC5225395 DOI: 10.1093/database/baw147
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Article distribution in BioC-BioGRID corpus (5)
| Organism | PMC articles | Interaction type |
|---|---|---|
| 60 | PPI and GI | |
| 38 | PPI and GI | |
| 17 | PPI | |
| 5 | GI |
Figure 1.Annotation interface for the BioC-BioGRID corpus (5). Overlapping annotation types are shown in yellow in the interface. Here, gene names appear yellow because they are annotated as both ‘Gene’ and as part of a mention sentence.
Some text passage examples that illustrate what annotators prefer to highlight for curation purposes. The first column shows the PMIDs. The second column lists the protein-protein and genetic interactions curated in BioGRID for these corresponding articles. And the third column shows example annotations in the BioC-BioGRID corpus that were marked by the curators for the interactions in those articles.
| PMID | BioGRID interactions | |
|
VPS29–VPS35 VPS5–VPS17 VPS5–VPS35 VPS5–VPS29 |
This complex, designated here as the retromer complex, assembles from two distinct subcomplexes comprising (a) Vps35p, Vps29p, and Vps26p; and (b) Vps5p and Vps17p. In addition we have found that Vps35p assembles into a high molecular weight complex in the cytosol, and this assembly is dependent upon Vps29p. Therefore, to test directly the possibility that Vps5p and Vps17p are interacting with Vps35p/Vps29p, P100 membranes were cross-linked as before, and Vps35p was immunoprecipitated from the resulting lysates. | |
|
VPS29–VPS35 VPS5–VPS17 VPS5–VPS35 VPS5–VPS29 |
In lane 3, antibodies against Vps29p immunoprecipitated both Vps29p and Vps35p, along with the three other proteins that coimmunoprecipitate with Vps35p (compare lanes 1 and 3). GST-Vps5p isolated from either wild-type (data not shown) or vps5Delta (Fig. 2 C) yeast lysates using glutathione-sepharose was found to be bound to Vps17p, Vps35p, and Vps29p (Fig. 2 C, lane 2), but none of these proteins was detected when a control lysate from a strain expressing just GST was treated with glutathione-sepharose (Fig. 2 C, lane 1). | |
| FUS–UPF1 |
We have also identified several human genes that, when over-expressed in yeast, are able to rescue the cell from the toxicity of mislocalized FUS/TLS. Over-expression of hUPF1 rescues the toxicity of both 1XFUS and 2XFUS (Figure 7A and B). | |
|
FUS–ECM32 FUS–SBP1 FUS–SKO1 FUS–VHR1 | All the FUS/TLS-specific suppressors are DNA/RNA binding proteins (Table 1, top section; and Figure 6A), including ECM32, SBP1, SKO1, and VHR1. |
Figure 2.Summary of annotations in the BioC-BioGRID corpus. The table in the top panel lists all types of annotation infons as key:value pairs, along with a short description of what each annotation describes. The bottom panel consists of three text boxes. Text box number 1 contains an example of text from a passage in a document from the corpus. Text box number 2 shows an annotation in that passage for the gene name and its GeneID. Text box number 3 contains an annotation for a GI evidence passage.
Figure 3.Annotation process for the BioC-BioGRID corpus. Phase I and II equally distributed the articles selected for curation among four curators so that curators had not seen the same article before. Articles contained no annotations, and curators were asked to curate them and mark the useful interactions information using the annotation interface. During Phase III, articles were equally distributed and curators were assigned articles not seen previously. Phase III articles contained pre-highlighted passages: text-mining predictions and passages annotated by only one of the Phases I or II annotators. This annotation phase asked the curators to review the annotations and remove the ones that were not useful for curation.
Figure 4.Graphic representation of IAA. For each article, an annotator highlighted several text passages as useful annotations for curation during Phase I. A second annotator reading the same article marked a different set of passages (Phase II). The two sets overlap, and also contain differences. Annotations of Phases I and II, which marked sentences that did not overlap, were re-assessed by two different curators in Phase III, where they decided whether that passage was useful or not. The striped area shows the set that was accepted during Phase III.
BioC-BioGRID corpus description of annotation types and their distribution
| Annotation type | Range per article | Average per article | Number of articles | Total |
|---|---|---|---|---|
| PPI mention | 0–69 | 16.4 | 114 | 1867 |
| PPI evidence | 0–36 | 6.4 | 109 | 701 |
| GI mention | 0–38 | 8.8 | 97 | 856 |
| GI evidence | 0–22 | 5.3 | 76 | 399 |
Annotation type averages are computed over the set of articles that contained that annotation type.
BioC-BioGRID corpus annotations showing the number and coverage of annotations per organism type
| Annotation type | Number of annotations and articles (yeast) | Number of annotations and articles (human) |
|---|---|---|
| PPI mention | 843 (58) | 1024 (56) |
| PPI evidence | 343 (55) | 358 (54) |
| GI mention | 551 (57) | 305 (40) |
| GI evidence | 250 (49) | 149 (27) |
Inter-annotator values measuring the overlap of annotations between Phases I and II, and how this overlap increased after Phase III was included (via checking a subset of previously non-overlapping annotations)
| Annotation type | IAA (Phase II) | IAA (Phase III) |
|---|---|---|
| PPI mention | 0.38 | 0.70 |
| PPI evidence | 0.32 | 0.56 |
| GI mention | 0.42 | 0.73 |
| GI evidence | 0.40 | 0.60 |
IAA for PPI and GI passages computed as in Table 5
| Annotation type | IAA (Phase II) | IAA (Phase III) |
|---|---|---|
| PPI passage | 0.54 | 0.88 |
| GI passage | 0.62 | 0.95 |
The mention and evidence annotations are combined for counting purposes.
Curators’ selected text-mining annotations show that when presented with a random selection of text-mining predictions (that did not overlap with any curators’ annotations), the curators still find useful information
| Annotation type | Curators’ selected text mining annotations | Text mining recall of human annotations |
|---|---|---|
| PPI mention | 0.26 | 0.77 |
| PPI evidence | 0.19 | 0.70 |
| GI mention/ GI evidence | 0.31 | 0.70 |
The second column shows the text mining recall of all human annotations.
Figure 5.Analysis of section titles of full text articles showing where different annotation types are highlighted by curators. The Y-axis shows the proportion of annotations for each annotation type.
Illustration of different sentences that are not useful for curation grouped by reasons why curators did not find them useful.
|
To confirm the genetic predictions and to understand the nature of these interactions, we examined vrp1-1 phenotypes, in addition to temperature sensitive growth, for suppression by ACT1. To distinguish between these possibilities, we predicted that if Tup1 and Hda1 work together, then deletion of TUP1 in apc5CA cells should have the same synergistic effects as an HDA1 deletion. If there was synthetic lethality between the RSC degron mutant and Deltadia2, the number of viable colonies will be reduced in galactose but not in dextrose. |
|
Suppression by SEC24 appeared to be specific, since parallel tests of 2mu plasmids carrying SEC12, SEC13, SEC31, or SEC23 failed to show suppression. A base variant, which we refer to as base*, was detected by this method (Fig. 2a, top). |
|
Similarly, these substitutions were lethal in Deltanhp6a/b cells (70). The yeast PIN domain protein Swt1/Yor166c (Synthetic lethal with TREX 1) was identified in a screen for synthetic lethality with the TREX subunit Hpr1, interacts functionally with the TREX complex and is required for optimal transcription rates [18]. |
| All together, our results demonstrate that the association of MUS81 with APBs is preferentially enriched at G2 phase. |
|
HA-YAP was precipitated from HepG2 cells expressing HA-YAP, and YAP ubiquitination was detected by an ubiquitin western blot. (A) Histone H3 associated with Rad53 is extensively modified. |
Additional infons as key:value pairs that complement the BioC-BioGRID corpus
| BioC-BioGRID informs as key:value pairs | Description of corpus annotations |
|---|---|
| Phase_I_Annotated :1 | Produced during Phase I |
| Phase_II_Annotated :1 | Produced during Phase II |
| Phase_III_Confirmed:0 | Reviewed during Phase III and no curator found it useful |
| Phase_III_Confirmed:1 | Reviewed during Phase III and one curator found it useful |
| Phase_III_Confirmed:2 | Reviewed during Phase III and two curators found it useful |
| Text_Mining_Shown:1 | Text-mining prediction shown to curators during Phase III |