| Literature DB >> 23951226 |
Shabana Vohra1, Philip C Biggin.
Abstract
There has been a rapid increase in the amount of mutational data due to, amongst other things, an increase in single nucleotide polymorphism (SNP) data and the use of site-directed mutagenesis as a tool to help dissect out functional properties of proteins. Many manually curated databases have been developed to index point mutations but they are not sustainable with the ever-increasing volume of scientific literature. There have been considerable efforts in the automatic extraction of mutation specific information from raw text involving use of various text-mining approaches. However, one of the key problems is to link these mutations with its associated protein and to present this data in such a way that researchers can immediately contextualize it within a structurally related family of proteins. To aid this process, we have developed an application called MutationMapper. Point mutations are extracted from abstracts and are validated against protein sequences in Uniprot as far as possible. Our methodology differs in a fundamental way from the usual text-mining approach. Rather than start with abstracts, we start with protein sequences, which facilitates greatly the process of validating a potential point mutation identified in an abstract. The results are displayed as mutations mapped on to the protein sequence or a multiple sequence alignment. The latter enables one to readily pick up mutations performed at equivalent positions in related proteins. We demonstrate the use of MutationMapper against several examples including a single sequence and multiple sequence alignments. The application is available as a web-service at http://mutationmapper.bioch.ox.ac.uk.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23951226 PMCID: PMC3739722 DOI: 10.1371/journal.pone.0071711
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Schematic flow chart of the process used in MutationMapper.
The starting point is a single sequence or multiple sequence alignment with the Uniprot ID or Accession code as the identifier. The identifier is used to query Uniprot to retrieve protein names, gene names and synonyms which are then used to retrieve abstracts from PubMed. Abstracts are converted to raw text and then the program MutationFinder [8] is used to extract possible mutations. These mutations are then mapped back to the protein sequence(s) with three possible outcomes: i) mapped, ii) non-mapped and iii) multi-mapped. Only mapped and multi-mapped results are highlighted on the sequence (or multiple sequence alignment) presented back to the user.
Figure 2Example screenshots from (A) Starting submission screen, (B) multiple-sequence alignment for P2X proteins and (C) detailed information screen from a mutation found for the P2X7 protein.
Information retrieval and mutation extraction in three test cases.
| Sequence | Keywords | Abstractsretrieved | Abstracts with mutations | Number of mutations |
| PTEN_HUMAN | “PTEN”, “PHOSPHATASE AND TENSIN HOMOLOG”, “MMAC1”, “TEP1” | 6001 | 297 | 435 |
| GLPA_HUMAN | “GLYCOPHORIN-A”, “MN SIALOGLYCOPROTEIN”,“SIALOGLYCOPROTEIN ALPHA”, “GYPA”, “GPA” | 8837 | 95 | 183 |
| CHEY_ECOLI | “CHEMOTAXIS PROTEIN CHEY”,”CHEY” | 933 | 51 | 103 |
These are the keywords that MutationMapper automatically extracted from Uniprot and used to search PubMed.
Information retrieval in NMDZ1_RAT with user input expressions.
| MM Expressions | User Input Expressions | |
|
| "GLUTAMATE NMDA RECEPTOR SUBUNIT ZETA1","NMETHYLDASPARTATE RECEPTOR SUBUNIT NR1","NMDR1", "GRIN1", "NMDAR1" | "GLUTAMATE NMDA RECEPTOR SUBUNIT ZETA1", "NMETHYLDASPARTATE RECEPTOR SUBUNIT NR1", "NMDR1", "GRIN1", "NMDAR1", |
|
| 2179 | 3178 |
|
| 77 | 103 |
|
| 143 | 218 |