| Literature DB >> 25858285 |
Guocai Chen1, Jieyi Zhao1, Trevor Cohen1, Cui Tao1, Jingchun Sun1, Hua Xu1, Elmer V Bernstam1, Andrew Lawson1, Jia Zeng1, Amber M Johnson1, Vijaykumar Holla1, Ann M Bailey1, Humberto Lara-Guerra1, Beate Litzenburger1, Funda Meric-Bernstam1, W Jim Zheng2.
Abstract
Ambiguous gene names in the biomedical literature are a barrier to accurate information extraction. To overcome this hurdle, we generated Ontology Fingerprints for selected genes that are relevant for personalized cancer therapy. These Ontology Fingerprints were used to evaluate the association between genes and biomedical literature to disambiguate gene names. We obtained 93.6% precision for the test gene set and 80.4% for the area under a receiver-operating characteristics curve for gene and article association. The core algorithm was implemented using a graphics processing unit-based MapReduce framework to handle big data and to improve performance. We conclude that Ontology Fingerprints can help disambiguate gene names mentioned in text and analyse the association between genes and articles. Database URL: http://www.ontologyfingerprint.orgEntities:
Mesh:
Year: 2015 PMID: 25858285 PMCID: PMC4390608 DOI: 10.1093/database/bav034
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.A diagram illustrates the process of assessing articles selected for a specific candidate gene name. In this example, ABGene or GNAT identified the candidate gene name pkb from the abstract with PMID 9368760. The identified gene name pkb matches the gene name or alias of both a cancer-related gene AKT1 and another gene PTK2B. We used the Ontology Fingerprints for both AKT1 and PTK2B to calculate a similarity score for the abstract. Because AKT1 has a higher score than PTK2B, this abstract was assigned to gene AKT1 rather than gene PTK2B.
Figure 2.An example of a PubMed abstract (PubMed ID: 9368760) that contains three GO terms for gene AKT1. The Ontology Fingerprint of the gene and the calculation of the gene’s rank are illustrated.
Figure 3.Annotation results for gene symbols in six groups, which contain common synonyms in each group. The blue bars indicate the correctly annotated genes referring to gene2pubmed, and the bars with slash red lines indicate all the annotations that do not match any gene2pubmed records.
Figure 4.The architecture of the GPU-based MapReduce framework for literature ranking.
True and false positives for the six genes
| 5290 | 207 | 4233 | 4893 | 2260 | 5728 | |
|---|---|---|---|---|---|---|
| 24 | 26 | 40 | 9 | 1 | 2 | |
| 7 | 0 | 0 | 0 | 0 | 0 |
Figure 5.ROC curve for the gene and article association for different levels of normalized ranks.
Figure 6.Precision over an increasing threshold for the cross-validation for articles published after 2009.
Figure 7.A negative case with a gene name pi3k annotated as gene PIK3CA by our method, and NCBI designated it as PIK3CG.
| Abs (gene) | Abs (no gene) | Subtotal | |
|---|---|---|---|
| Abs (GO term) | — | ||
| Abs (no GO term) | — | ||
| Subtotal | Otherwise |