| Literature DB >> 17274683 |
Lawrence C Lee1, Florence Horn, Fred E Cohen.
Abstract
Protein point mutations are an essential component of the evolutionary and experimental analysis of protein structure and function. While many manually curated databases attempt to index point mutations, most experimentally generated point mutations and the biological impacts of the changes are described in the peer-reviewed published literature. We describe an application, Mutation GraB (Graph Bigram), that identifies, extracts, and verifies point mutations from biomedical literature. The principal problem of point mutation extraction is to link the point mutation with its associated protein and organism of origin. Our algorithm uses a graph-based bigram traversal to identify these relevant associations and exploits the Swiss-Prot protein database to verify this information. The graph bigram method is different from other models for point mutation extraction in that it incorporates frequency and positional data of all terms in an article to drive the point mutation-protein association. Our method was tested on 589 articles describing point mutations from the G protein-coupled receptor (GPCR), tyrosine kinase, and ion channel protein families. We evaluated our graph bigram metric against a word-proximity metric for term association on datasets of full-text literature in these three different protein families. Our testing shows that the graph bigram metric achieves a higher F-measure for the GPCRs (0.79 versus 0.76), protein tyrosine kinases (0.72 versus 0.69), and ion channel transporters (0.76 versus 0.74). Importantly, in situations where more than one protein can be assigned to a point mutation and disambiguation is required, the graph bigram metric achieves a precision of 0.84 compared with the word distance metric precision of 0.73. We believe the graph bigram search metric to be a significant improvement over previous search metrics for point mutation extraction and to be applicable to text-mining application requiring the association of words.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17274683 PMCID: PMC1794323 DOI: 10.1371/journal.pcbi.0030016
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Figure 1An Energy-Minimized Graph Generated from the Full-Text Article PMID 11553787
The blue ellipses represent protein term nodes, green ellipses represent point mutation nodes, and orange ellipses represent organism nodes. The gray triangles represent regular words. The connecting edges show terms or words represented by the nodes that are present as a bigram in the text. For this article, a total of 1,052 terms are contained in 2,287 bigrams.
Figure 2A General Overview of the Process Flow of Mutation GraB
Protein Family Literature Sets
Protein Family and Dictionary Information
Mutation GraB Performance on the GPCR Literature Sets
Figure 3Examining the Precision of the Graph Bigram and Word Distance Metrics across Different Levels of Possible Protein Associations for the GPCR (A), Protein Tyrosine Kinase (B), and Ion Channel Transporter (C) Literature Sets
This data is for the cumulative development and validation sets combined. The yellow bars show the number of point mutations counted at each PPA. The solid blue line represents the precision measured for these point mutations using the graph bigram metric, and the dotted red line is measured using the word distance metric.
Mutation GraB Performance on the Protein Tyrosine Kinase Literature Sets
Mutation GraB Performance on the Ion Channel Transporter Literature Sets
Mutation GraB versus MEMA Performance
Xylanase Literature Set and Proteins and Point Mutations within
Mutation GraB versus Mutation Miner Performance
Figure 4Example of a Paragraph of Text Evaluated by the Graph Bigram and Word Distance Metrics
(A)Text is taken from a figure label from the article PMID 10889210.
(B)Graph generated by bigram traveral using the graph bigram method. The point mutation terms are in green, protein terms in blue, and regular words in gray.
(C)Table shows the measurements between some selected words in the text using both the word distance and graph bigram metrics. The word–distance measurements are below the diagonal, and the graph bigram measurements are above the diagonal. Two different word pairs are examined, {fig, bars} and {alteration, scatchard}.
The {fig, bars} words are shown in red in (A), the path is colored in red in (B), and the metric measurements are highlighted in red in (C). The {alteration, scatchard} items are highlighted in blue, correspondingly.
Mutation GraB Performance on All Protein Family Literature Sets with and without Image Mutations Using the Graph Bigram Metric