| Literature DB >> 33046717 |
Chuming Chen1, Hongzhan Huang2, Karen E Ross3, Julie E Cowart2, Cecilia N Arighi2, Cathy H Wu2,3, Darren A Natale3.
Abstract
The Protein Ontology (PRO) provides an ontological representation of protein-related entities, ranging from protein families to proteoforms to complexes. Protein Ontology Linked Open Data (LOD) exposes, shares, and connects knowledge about protein-related entities on the Semantic Web using Resource Description Framework (RDF), thus enabling integration with other Linked Open Data for biological knowledge discovery. For example, proteins (or variants thereof) can be retrieved on the basis of specific disease associations. As a community resource, we strive to follow the Findability, Accessibility, Interoperability, and Reusability (FAIR) principles, disseminate regular updates of our data, support multiple methods for accessing, querying and downloading data in various formats, and provide documentation both for scientists and programmers. PRO Linked Open Data can be browsed via faceted browser interface and queried using SPARQL via YASGUI. RDF data dumps are also available for download. Additionally, we developed RESTful APIs to support programmatic data access. We also provide W3C HCLS specification compliant metadata description for our data. The PRO Linked Open Data is available at https://lod.proconsortium.org/ .Entities:
Mesh:
Substances:
Year: 2020 PMID: 33046717 PMCID: PMC7550340 DOI: 10.1038/s41597-020-00679-9
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Fig. 1A PRO RDF data model (PR:000046294). Ellipse and circle shapes are RDF nodes. Rectangle shapes are RDF literals. Directed edges are RDF properties. Circle shapes represent anonymous classes or blank nodes. ‘AKT1’, used here for brevity, is the gene for ‘RAC-alpha serine/threonine-protein kinase’.
Example PRO terms in each category.
| Category | Term | Name | Link to Example | |
|---|---|---|---|---|
| Family | organism-neutral | PR:000000027 | smad protein | |
| organism-specific | PR:000044507 | 14-3-3 protein (human) | ||
| Gene | organism-neutral | PR:000000364 | smad2 | |
| organism-specific | PR:000022736 | fumarate hydratase class II | ||
| SeqGroup | organism-neutral | PR:000050216 | receptor-type tyrosine-protein phosphatase C isoform CD45R | |
| organism-specific | PR:Q9ULB1 | neurexin-1-alpha (human) | ||
| Sequence | organism-neutral | PR:000000048 | TGF-beta receptor type-2 isoform RII-1 | |
| organism-specific | PR:Q68FF6-1 | ARF GTPase-activating protein GIT1 isoform 1 (mouse) | ||
| Modification | organism-neutral | PR:000049939 | RAC-alpha serine/threonine-protein kinase phosphorylated 3 | |
| organism-specific | PR:000046294 | RAC-alpha serine/threonine-protein kinase phosphorylated 3 (human) | ||
| Complex | organism-neutral | PR:000027291 | phosphorylase kinase complex PHKL | |
| organism-specific | PR:000036137 | lipopolysaccharide receptor complex 3; endosome membrane (human) | ||
Fig. 2Knowledge graph of exemplary query result of federated SPARQL query 1 (Get all human genes in PRO whose UniProtKB counterpart has variants with loss of function implicated in disease). Ellipse shapes are RDF nodes. Rectangle shapes are RDF literals. Directed edges are RDF properties.
Fig. 3Knowledge graph of exemplary query result of federated SPARQL query 2 (Find variants in UniProt or DisGeNET for AlzForum PRO terms). Ellipse and circle shapes are RDF nodes. Rectangle shapes are RDF literals. Directed edges are RDF properties. Circle shapes represent anonymous classes or blank nodes.
The formal FAIRness assessment results for Protein Ontology Linked Open Dataset.
| Metric | Requirement | Resource | Resource data/content |
|---|---|---|---|
| F1A | IRI for a registered identifier scheme for your resource’s IRI | PURL schema | PURL schema |
| F1B | IRI to a document describing the persistency policy for the identifier of this data | ||
| F2 | IRI for machine-readable metadata for the resource | ||
| IRI to file format for this metadata | |||
| F3 | Is the resource identifier specified in the metadata? | Yes | Yes |
| F4 | URL to a search engine indexing your resource | ||
| Search query/terms | “Protein Ontology” - > First hit | “PR_000025934” - > First hit | |
| A1.1 | URL to the description of the protocol | HTTP | HTTP |
| https://ecciki/Hypertext_Transfer_Protocol | |||
| Is the protocol open? | Yes | Yes | |
| Is the protocol (royalty) free? | Yes | Yes | |
| A1.2 | Is authorization required to access the content of your resource? | No | No |
| A2 | URL to metadata longevity plan | Provided at the dataset level | |
| I1 | URL to a specification language | RDFS and OWL ontology | RDFS and OWL ontology |
| I2 | Maximum 3 IRIs for vocabularies used within the (meta)data | dcterms:title “PRO Linked Open Data”@en | |
| dcat:keyword “Protein Ontology”^^xsd:string, “Linked Open Data”^^xsd:string | |||
| void:linkPredicate skos:closeMatch | |||
| I3 | URL to a LinkSet ( | Provided at the dataset level | |
| R1.1 | URL to license/terms of use for the resource | ||
| R1.2 | Maximum 3 IRIs used to describe the provenance of the resource | dcterms:accrualPeriodicity freq:quarterly | |
| dcterms:publisher [foaf:page] | |||
| pav:hasCurrentVersion:prolod59_0 | |||
| Maximum 3 IRIs used to describe domain information | oboInOwl:hasSynonymType pr:PRO-short-label | ||
| obo:PR_Q7TMZ5 | |||
| oboInOwl:hasSynonymType pr:PRO-short-label | |||
| R1.3 | IRI that represents certification from a recognized authority | Provided at the dataset level |
Fig. 4Virtuoso faceted browser query interface and result table view.
Fig. 5PRO LOD SPARQL GUI. It provides users with a portal to query Protein Ontology Linked Open Data using the SPARQL 1.1 standards as well as a comprehensive set of example queries.
Fig. 6API documentation for Protein Ontology Linked Open Data. The Swagger™ API generates an interactive webpage where users can ‘try out’ the service with real queries. Results are returned in the ‘Response Body’ in the user selected response format (JSON illustrated) or XML.
Currently supported Protein Ontology RESTful API endpoints.
| API Operation Group | API Access Path* | Description |
|---|---|---|
| PRO Terms | /pros | Search PRO terms. |
| /pros/{proIds} | Return PRO terms by IDs. | |
| Proteoform Terms | /proforms/modification | Returns a list of modified protein forms. |
| /proforms/modification/phosphorylated | Returns a list of phosphorylated protein forms. | |
| /proforms/modification/methylated | Returns a list of methylated protein forms. | |
| /proforms/modification/acetylated | Returns a list of acetylated protein forms. | |
| /proforms/modification/ubiquitinated | Returns a list of ubiquitinated protein forms. | |
| /proforms/modification/glycosylated | Returns a list of glycosylated protein forms. | |
| /proforms/orthoisoform | Returns a list of ortho-isoform protein forms. | |
| /proforms/orthomodform | Returns a list of ortho-modform protein forms. | |
| /proforms/sequence | Returns a list of sequence level protein forms. | |
| /proforms/organism-sequence | Returns a list of organism-sequence level protein forms. | |
| Protein Evolutionary Terms | /proevos/family | Returns a list of family level protein terms. |
| /proevos/gene | Returns a list of gene level protein terms. | |
| /proevos/organism-gene | Returns a list of organism-gene level protein terms. | |
| Protein Complex Terms | /procomps/organism | Returns a list of organism specific protein complex terms. |
| /procomps | Returns a list of organism non-specific protein complex terms. | |
| Database Cross-references | /dbxrefs/EcoCyc_ID | Returns a list of PRO terms with EcoCyC ID as cross-reference. |
| /dbxrefs/HGNC_ID | Returns a list of PRO terms with HGNC ID as cross-reference. | |
| /dbxrefs/MGI_ID | Returns a list of PRO terms with MGI ID as cross-reference. | |
| /dbxrefs/Ontology_ID | Returns a list of PRO terms with Ontology ID as cross-reference. | |
| /dbxrefs/PANTHER_ID | Returns a list of PRO terms with PANTHER ID as cross-reference. | |
| /dbxrefs/PIRSF_ID | Returns a list of PRO terms with PIRSF ID as cross-reference. | |
| /dbxrefs/PMID | Returns a list of PRO terms with PMID as cross-reference. | |
| /dbxrefs/Reactome_ID | Returns a list of PRO terms with Reactome ID as cross-reference. | |
| /dbxrefs/NCBITaxon_ID | Returns a list of PRO terms with NCBI Taxon ID as cross-reference. | |
| /dbxrefs/UniProtKB_ID | Returns a list of PRO terms with UniProtKB ID as cross-reference. | |
| PRO Annotation File | /paf/{proIds} | Returns annotations for the given PRO ID(s). |
| OBO File | /obo/{proIds} | Returns PRO term in OBO format for the given PRO ID(s). |
| DAG | /dag/parent/{proIds} | Returns direct parent PRO terms by the given PRO ID(s). |
| /dag/ancestor/{proIds} | Returns direct and indirect parent PRO terms by the given PRO ID(s). | |
| /dag/children/{proIds} | Returns direct children PRO terms by the given PRO ID(s). | |
| /dag/descendant/{proIds} | Returns direct and indirect children PRO terms by the given PRO ID(s). | |
| /dag/hierarchy/{proId} | Returns hierarchy of PRO terms by the given PRO ID. |
*After “https://research.bioinformatics.udel.edu/PRO_API/V1”.
RDF centric statistics for Protein Ontology Linked Open Dataset (release 61.0).
| Dataset | Named Graph | Triples | Classes | Entities | Subjects | Predicates | Objects |
|---|---|---|---|---|---|---|---|
| pro | 11,858,7202 | 8 | 1,996,180 | 2,285,489 | 48 | 3,1298,222 | |
| paf | 91,407 | 4 | 8,313 | 19,733 | 22 | 31,301 | |
| pro-ensembl-closeMatch-linkset | 48 | 0 | 0 | 24 | 2 | 24 | |
| pro-ensemblbacteria-closeMatch-linkset | 1,758 | 0 | 0 | 493 | 2 | 879 | |
| pro-hgnc-closeMatch-linkset | 36,542 | 0 | 0 | 18,268 | 2 | 18,271 | |
| pro-mgi-closeMatch-linkset | 31,024 | 0 | 0 | 15,509 | 2 | 15,512 | |
| pro-ncbigene-closeMatch-linkset | 5,310 | 0 | 0 | 1,693 | 2 | 2,655 | |
| pro-reactome-closeMatch-linkset | 29,042 | 0 | 0 | 10,169 | 2 | 14,521 | |
| pro-reactome-exactMatch-linkset | 43,998 | 0 | 0 | 11,220 | 3 | 14,666 | |
| pro-rgd-closeMatch-linkset | 14,690 | 0 | 0 | 7,343 | 2 | 7,345 | |
| pro-sgd-closeMatch-linkset | 2,522 | 0 | 0 | 1,261 | 2 | 1,261 | |
| pro-uniprotkb-closeMatch-linkset | 274,308 | 0 | 0 | 54,960 | 2 | 133,016 | |
| pro-uniprotkb-exactMatch-linkset | 527,418 | 0 | 0 | 175,793 | 3 | 169,975 | |
| pro-uniprotkbvar-closeMatch-linkset | 938 | 0 | 0 | 41 | 2 | 469 | |
| pro-uniprotkbvar-exactMatch-linkset | 1,407 | 0 | 0 | 469 | 3 | 469 | |
| pro-wormbase-closeMatch-linkset | 3,570 | 0 | 0 | 1,785 | 2 | 1,785 | |
| void | 2,326 | 7 | 74 | 292 | 63 | 606 |
SPARQL queries used to calculate statistics can be found at https://code.google.com/archive/p/void-impl/wikis/SPARQLQueriesForStatistics.wiki.
The triples in “exactMatch-linkset” use “skos:exactMatch” as linkPredicate. The triples in “closeMatch-linkset” use “skos:closeMatch” as linkPredicate.