| Literature DB >> 23517090 |
Anne-Lise Veuthey1, Alan Bridge, Julien Gobeill, Patrick Ruch, Johanna R McEntyre, Lydie Bougueleret, Ioannis Xenarios.
Abstract
BACKGROUND: The annotation of protein post-translational modifications (PTMs) is an important task of UniProtKB curators and, with continuing improvements in experimental methodology, an ever greater number of articles are being published on this topic. To help curators cope with this growing body of information we have developed a system which extracts information from the scientific literature for the most frequently annotated PTMs in UniProtKB.Entities:
Mesh:
Year: 2013 PMID: 23517090 PMCID: PMC3660268 DOI: 10.1186/1471-2105-14-104
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1A typical sentence with information on protein glycosylation: Boxes indicate the information that is extracted from the sentence.
PTM filtering tokens and information extraction assessment
| Acetylation | “acet” | 26,144 | 1,753 | 65% | 97 | 89% |
| Amidation | “amid” | 21,861 | 1,515 | 73% | 61 | 95% |
| Disulfide bond | “disulf” | 6,933 | 1,095 | 94% | 514 | 75% |
| Glycosylation | “glyco” | 31,379 | 2,746 | 73% | 464 | 85% |
| Methylation | “methyl” | 28,015 | 664 | 57% | 47 | 87% |
| Phosphorylation | “phospho” | 61,144 | 16,129 | 71% | 906 | 93% |
| Sulfation | “sulf” | 20,834 | 256 | 65% | 40 | 92% |
“Filtering token” is the term used to select the abstracts, “# filtered abstracts” is the number of abstracts which contain these terms, and “# retrieved abstracts” is the number of abstracts selected by the complete sentence extraction procedure. Precision was estimated based on manual analysis of 100 positive abstracts.
Figure 2An abstract containing information relevant to protein acetylation: the extracted sentences are highlighted in orange, PTM and site information in yellow, and gene/protein mentions in blue. The list of extracted sites and proteins with scores are also provided. The two last sentences which mention acetylation are not highlighted since they contain no site information.
Assessment on the GENIA and RLIMS-P corpora
| # events/documents | 68 | 93 | 70 | 110 |
| False negative | 7 | 31 | 26 | 15 |
| False positive | 2 | 0 | 0 | 12 |
| Recall | 90% | 66% | 63% | 86% |
| Precision | 97% | 100% | 100% | 89% |
#events is the number of acetylation, glycosylation or methylation events with site information on in the GENIA corpus, #documents is the number of abstracts positive for phosphorylation information in the RLIMP-P corpus.
Results of screening PubMed abstracts for PTM information
| Retrieved abstracts | 75,777 |
| With PTM information | 1,266 (863) |
| Acetylation | 119 (56) |
| Amidation | 96 (6) |
| Disulfide bridge | 173 (42) |
| Glycosylation | 108 (27) |
| Methylation | 26 (6) |
| Phosphorylation | 730 (730) |
| Sulfation | 14 (2) |
“Retrieved abstracts” are the result of querying PubMed with the keyword “protein”. The number of PTMs with positional information is shown in parentheses for each PTM type (site information was a prerequisite for retrieval of phosphorylation information).
Figure 3Phosphosite information retrieval: pipeline for the retrieval of documents that potentially provide supporting evidence for existing phosphosite annotations in UniProtKB/Swiss-Prot, where such annotations were made on the basis of information from high-throughput mass spectrometry-based proteomics experiments.