| Literature DB >> 26338771 |
Emily K Mallory1, Ce Zhang2, Christopher Ré3, Russ B Altman4.
Abstract
MOTIVATION: A complete repository of gene-gene interactions is key for understanding cellular processes, human disease and drug response. These gene-gene interactions include both protein-protein interactions and transcription factor interactions. The majority of known interactions are found in the biomedical literature. Interaction databases, such as BioGRID and ChEA, annotate these gene-gene interactions; however, curation becomes difficult as the literature grows exponentially. DeepDive is a trained system for extracting information from a variety of sources, including text. In this work, we used DeepDive to extract both protein-protein and transcription factor interactions from over 100,000 full-text PLOS articles.Entities:
Mesh:
Year: 2015 PMID: 26338771 PMCID: PMC4681986 DOI: 10.1093/bioinformatics/btv476
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Gene–gene extraction pipeline. (A) We performed text pre-processing to parse documents into sentences and tokens and to construct dependency graphs between tokens in the sentences. This parsed data were stored in a sentences database. (B) The gene–gene extractor constructed candidate relations from the sentences and deposited them into a database. These relations composed of a pair of genes and features from the sentence. (C) DeepDive calculated probabilities that the candidate relation was an interaction using inference rules based on the features. (D) We performed system tuning to identify and correct system errors. Furthermore, we performed a snowball technique where we input correct relations as new training examples in the next system iteration
Fig. 2.High-level feature patterns. Boxes represent relevant patterns in the sentence for the feature. Light gray boxes indicate features for GeneA and black boxes indicate features for GeneB. Shared boxes are represented with a medium gray box. The feature applies to both genes if only a light gray box is present
Fig. 3.Histogram of probabilities assigned to gene–gene candidate relations
Top 10 positive gene–gene features from DeepDive
| Feature | Weight |
|---|---|
| Single_Verb_Between_Genes_[bind] | 1.25 |
| Single_Verb_Between_Genes_[interact] | 1.07 |
| Verb_On_Dependency_Path_[bind] | 0.91 |
| Verb_On_Dependency_Path_[interact] | 0.74 |
| Single_Verb_Between_Genes_[regulate] | 0.67 |
| Verb_Between_Genes_[bind] | 0.63 |
| Verb_On_Dependency_Path_[regulate] | 0.58 |
| Window_Left_Gene1_Phrase_[GENE and] | 0.57 |
| Window_Right_Gene2_1gram_[protein] | 0.57 |
| Window_Left_Gene1_Phrase_[interaction between] | 0.51 |
Evaluation against the DIP curated gold standard
| Precision | Recall | ||
|---|---|---|---|
| DIP | 0.48 | 0.11 | 0.17 |
| DIP_Rescue | 0.68 | 0.14 | 0.23 |
| DIP_Sentence | 0.68 | 0.46 | 0.54 |
| DIP_Indirect | 0.76 | 0.49 | 0.59 |
DIP contained all unique interactions, DIP_Rescue included true positives not curated by DIP, DIP_Sentence included DIP_Rescue but only included standard gene symbols co-occurring in a sentence as positives in DIP and DIP_Indirect included DIP_Sentence and indirect interactions as true positives.
Sentence- and document-level precision for Curation_Positive set of gene–gene candidate relations
| Sentence-level precision | Document-level precision | |
|---|---|---|
| Curation_Positive_Stringent | 0.62 | 0.71 |
| Curation_Positive_All | 0.79 | 0.83 |
Fig 4.Number of publications per year for genes appearing in high probability interactions. Top genes are ordered by publication count in 2013