| Literature DB >> 27899649 |
Darren A Natale1, Cecilia N Arighi2, Judith A Blake3, Jonathan Bona4, Chuming Chen2, Sheng-Chih Chen2, Karen R Christie3, Julie Cowart2, Peter D'Eustachio5, Alexander D Diehl6,7, Harold J Drabkin3, William D Duncan8,7, Hongzhan Huang2, Jia Ren2, Karen Ross9, Alan Ruttenberg4, Veronica Shamovsky5, Barry Smith10, Qinghua Wang2, Jian Zhang9, Abdelrahman El-Sayed9, Cathy H Wu9,2.
Abstract
The Protein Ontology (PRO; http://purl.obolibrary.org/obo/pr) formally defines and describes taxon-specific and taxon-neutral protein-related entities in three major areas: proteins related by evolution; proteins produced from a given gene; and protein-containing complexes. PRO thus serves as a tool for referencing protein entities at any level of specificity. To enhance this ability, and to facilitate the comparison of such entities described in different resources, we developed a standardized representation of proteoforms using UniProtKB as a sequence reference and PSI-MOD as a post-translational modification reference. We illustrate its use in facilitating an alignment between PRO and Reactome protein entities. We also address issues of scalability, describing our first steps into the use of text mining to identify protein-related entities, the large-scale import of proteoform information from expert curated resources, and our ability to dynamically generate PRO terms. Web views for individual terms are now more informative about closely-related terms, including for example an interactive multiple sequence alignment. Finally, we describe recent improvement in semantic utility, with PRO now represented in OWL and as a SPARQL endpoint. These developments will further support the anticipated growth of PRO and facilitate discoverability of and allow aggregation of data relating to protein entities.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27899649 PMCID: PMC5210558 DOI: 10.1093/nar/gkw1075
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Abbreviations used
| Abbreviation | Full name |
|---|---|
| GO | Gene Ontology |
| OBO | Open Biomedical Ontologies |
| OWL | Web Ontology Language |
| PSI-MOD | Protein Structure Initiative Modification Ontology |
| PTM; iPTMnet | Post-translational modification; integrated PTM network |
| SPARQL | SPARQL Protocol and RDF Query Language |
| UniProt; UniProtKB | Universal Protein Resource; UniProt KnowledgeBase |
| URI; URL | Uniform Resource Identifier; Uniform Resource Locator |
| W3C | World Wide Web Consortium |
Figure 1.Standard representation of a proteoform. Proteoforms are represented using a standard format as annotated, consisting of a sequence block and one or more optional modification blocks. Sequence blocks consist of a UniProtKB accession with an optional isoform indicator separated by a dash, followed by a comma (first arrow) and optional subsequence range. Modification block 1, if specified, will follow a comma, and all other modification blocks will follow a pipe (second arrow). Each modification block is presented in order based on the N-terminal-most amino acid modified. Within a modification block are one or more amino acids listed by type and position, with multiples separated by slashes, followed by the PSI-MOD identifier specifying the type of modification. When an isoform is specified, N-terminal and C-terminal positions of subsequences as well as positions of modification are relative to the full length of that isoform; otherwise the numbering for the representative sequence is assumed. Only the accession is required. Missing subsequence indicates that the class encompasses either multiple species or multiple isoforms. Missing modification blocks with a subsequence indicates that the class is defined by subsequence only (such as when the only distinction is that a signal peptide has been removed).
Figure 2.Sections of the PRO web page for the gene level protein class BUB1B (PR:000004855) and its subclasses. (A) Interactive Sequence View displaying a multiple sequence alignment of BUB1B proteoforms across organisms with experimentally determined phosphorylation sites highlighted in pink and potential phosphorylation sites highlighted in gray. Alignments are generated on-the-fly using MUSCLE (22) with default parameters. (B) Portion of the Protein Forms table displaying information about the gene level BUB1B term (PR:000004855) and several of its subclasses. (C) BUB1B proteoforms found in at least one complex.
Figure 3.Section of the PRO web page for organism-complex term (PR:000035398) for the frog BUB1B:APC:EB1 complex showing the Table of complex subunits.
Figure 4.PRO SPARQL query. (A) A sample query designed to retrieve all the subclasses of PR:000000006, showing the term URI, label, and PRO category. (B) A subset of the query's 85 results. Note that for brevity the XML tag for labels and categories is not shown (the full results would have ‘∧∧http://www.w3.org/2001/XMLSchema#string’ appended to each). (C) The hierarchy for the subset shown in (B).