| Literature DB >> 26989148 |
Sebastian Burgstaller-Muehlbacher1, Andra Waagmeester2, Elvira Mitraka3, Julia Turner4, Tim Putman4, Justin Leong5, Chinmay Naik6, Paul Pavlidis5, Lynn Schriml3, Benjamin M Good1, Andrew I Su1.
Abstract
Open biological data are distributed over many resources making them challenging to integrate, to update and to disseminate quickly. Wikidata is a growing, open community database which can serve this purpose and also provides tight integration with Wikipedia. In order to improve the state of biological data, facilitate data management and dissemination, we imported all human and mouse genes, and all human and mouse proteins into Wikidata. In total, 59,721 human genes and 73,355 mouse genes have been imported from NCBI and 27,306 human proteins and 16,728 mouse proteins have been imported from the Swissprot subset of UniProt. As Wikidata is open and can be edited by anybody, our corpus of imported data serves as the starting point for integration of further data by scientists, the Wikidata community and citizen scientists alike. The first use case for these data is to populate Wikipedia Gene Wiki infoboxes directly from Wikidata with the data integrated above. This enables immediate updates of the Gene Wiki infoboxes as soon as the data in Wikidata are modified. Although Gene Wiki pages are currently only on the English language version of Wikipedia, the multilingual nature of Wikidata allows for usage of the data we imported in all 280 different language Wikipedias. Apart from the Gene Wiki infobox use case, a SPARQL endpoint and exporting functionality to several standard formats (e.g. JSON, XML) enable use of the data by scientists. In summary, we created a fully open and extensible data resource for human and mouse molecular biology and biochemistry data. This resource enriches all the Wikipedias with structured information and serves as a new linking hub for the biological semantic web. Database URL: https://www.wikidata.org/.Entities:
Mesh:
Year: 2016 PMID: 26989148 PMCID: PMC4795929 DOI: 10.1093/database/baw015
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1Wikidata item and data organization. Wikidata items can be added or edited by anyone manually. A Wikidata item consists of: (1) a language-specific label, (2) its unique identifier, (3) language specific aliases, (4) interwiki links to the different language Wikipedia articles or other Wikimedia projects and (5) a list of statements. For this specific example, the human protein Reelin was used (https://www.wikidata.org/wiki/Q13569356)
Figure 2Gene Wiki data model in Wikidata. Each entity (human gene, human protein, mouse gene, mouse protein) is represented as a separate Wikidata item. Arrows represent direct links between Wikidata statements. The English language interwiki link on the human gene item points to the corresponding Gene Wiki article on the English Wikipedia.
Overview on Homo sapiens and Mus musculus data imported to Wikidata
| Data source | Item count |
|---|---|
| 59 721 | |
| 27 306 | |
| 73 355 | |
| 16 728 | |
| Gene Ontology terms | 17 098 |
Wikidata properties used in this study
| Label | Target data type | Property ID | Description |
|---|---|---|---|
| Wikidata gene items | |||
| Subclass of | Wikidata item | P279 | Defines to what category this item belongs to. Every gene item carries the value ‘gene’ (Q7187). Further subcategories are protein coding gene (Q20747295), ncRNA gene (Q27087), snRNA gene (Q284578), snoRNA gene (Q284416), rRNA gene (Q215980), tRNA gene (Q201448) and pseudogene (Q277338). |
| Entrez Gene ID | String | P351 | The NCBI gene ID as in annotation release 107 |
| Found in taxon | Wikidata item | P703 | The taxon, either |
| Ensembl Gene ID | String | P594 | Gene ID from the Ensembl database |
| Ensembl Transcript ID | String | P704 | Transcript IDs from the Ensembl database |
| Gene symbol | String | P353 | Human gene symbol according to HUGO Gene Nomenclature Committee |
| HGNC ID | String | P354 | HUGO Gene Nomenclature Committee ID |
| HomoloGene ID | String | P593 | Identifier for the HomoloGene database |
| NCBI RefSeq RNA ID | String | P639 | |
| Chromosome | Wikidata item | P1057 | Chromosome a gene is residing on |
| Ortholog | Wikidata item | P684 | Ortholog based on the Homologene database |
| Genomic start | String | P644 | Genomic start according to GRCh37 and GRCh38, sourced from NCBI |
| Genomic stop | String | P645 | Genomic stop according to GRCh37 and GRCh38, sourced from NCBI |
| Mouse Genome Informatics ID | String | P671 | Jackson lab mouse genome informatics database |
| encodes | Wikidata item | P688 | Protein item a gene encodes |
| Subclass of | Wikidata item | P279 | protein (Q8054) |
| UniProt ID | String | P352 | |
| PDB ID | String | P638 | Protein structure IDs from PDB.org |
| RefSeq protein ID | String | P637 | NCBI RefSeq Protein ID |
| encoded by | Wikidata item | P702 | Gene item a protein is encoded by |
| Ensembl protein ID | String | P705 | |
| EC number | String | P591 | Enzyme Category number |
| Protein structure image | Wiki Commons Media File | P18 | Prefered protein structure image retrieved from PDB.org |
| Cell component | Wikidata item | P681 | Gene ontology term items for cell components |
| Biological process | Wikidata item | P682 | Gene ontology term items for biological processes |
| Molecular function | Wikidata item | P680 | Gene ontology term items for molecular function |
Column one contains the description as in Wikidata, column two the data type, column three the property number and column four a short description of the nature of the content.
Figure 4An example SPARQL query, using the Wikidata SPARQL endpoint (query.wikidata.org). It retrieves all Wikidata (WD) items which are of subclass protein-coding gene (Q840604), which have a chromosomal start position (P644) according to human genome build GRCh38 and reside on human chromosome (P659) 9 (Q20966585) and a chromosomal end position (P645) also on chromosome 9. Furthermore, the region of interest is restricted to a chromosomal start position between 21 and 30 megabase pairs. Colors: Red indicates SPARQL commands, blue represents variable names, green represents URIs and brown are strings. Arrows point to the source code the description applies to.