Literature DB >> 32324859

GlyGen data model and processing workflow.

Robel Kahsay¹, Jeet Vora¹, Rahi Navelkar¹, Reza Mousavi¹, Brian C Fochtman¹, Xavier Holmes¹, Nagarajan Pattabiraman¹, Rene Ranzinger², Rupali Mahadik², Tatiana Williamson², Sujeet Kulkarni², Gaurav Agarwal², Maria Martin³, Preethi Vasudev³, Leyla Garcia⁴, Nathan Edwards⁵, Wenjin Zhang⁵, Darren A Natale⁵, Karen Ross⁵, Kiyoko F Aoki-Kinoshita⁶, Matthew P Campbell⁷, William S York², Raja Mazumder¹.

Abstract

SUMMARY: Glycoinformatics plays a major role in glycobiology research, and the development of a comprehensive glycoinformatics knowledgebase is critical. This application note describes the GlyGen data model, processing workflow and the data access interfaces featuring programmatic use case example queries based on specific biological questions. The GlyGen project is a data integration, harmonization and dissemination project for carbohydrate and glycoconjugate-related data retrieved from multiple international data sources including UniProtKB, GlyTouCan, UniCarbKB and other key resources.
AVAILABILITY AND IMPLEMENTATION: GlyGen web portal is freely available to access at https://glygen.org. The data portal, web services, SPARQL endpoint and GitHub repository are also freely available at https://data.glygen.org, https://api.glygen.org, https://sparql.glygen.org and https://github.com/glygener, respectively. All code is released under license GNU General Public License version 3 (GNU GPLv3) and is available on GitHub https://github.com/glygener. The datasets are made available under Creative Commons Attribution 4.0 International (CC BY 4.0) license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Species

Year: 2020 PMID： 32324859 PMCID： PMC7320628 DOI： 10.1093/bioinformatics/btaa238

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

This application note introduces the GlyGen data-processing workflow used to build the backend for the GlyGen (York ) knowledgebase. This includes detailed information on the molecular, biophysical and functional properties of glycans, genes and proteins organized in pathways and ontologies as well as a rapidly growing body of biological big data related to mutation and expression. All data integrated in the GlyGen project are publicly available in standard formats supported by NCBI (Sayers ) and EMBL-EBI (Cook ) to promote standardization and sharing of data within the broader glycomics community. GlyGen is a five-star linked open data compliant knowledgebase and a registered member of FAIRsharing.org fulfilling BioDBcore requirements (https://fairsharing.org/biodbcore-001375/).

2 Data integration workflow

The framework used to integrate GlyGen data starts by collecting glycan, protein and glycoprotein datasets from major data resources and data generators. The collected heterogeneous datasets are processed following the workflow shown in Figure 1.

Fig. 1.

GlyGen data processing workflow showing various steps. Data are retrieved from various resources including UniProtKB, GlyTouCan, UniCarbKB, RefSeq and other key resources, followed by extraction and filtering based on relevance to glycobiology. Extracted data are integrated after harmonization that is based on various standard ontologies. The resulting datasets are then ingested into a MongoDB docstore and Virtuoso triplestore using the GlyGen data model

2.1 Data sources

In GlyGen, GlyTouCan (Tiemeyer ) and PubChem (Kim ) provide glycan-related data, whereas protein-related data are collected from resources, such as UniProtKB (UniProt Consortium, 2019), NCBI Reference Sequence (RefSeq) (O’Leary ), BioMuta (Dingerdissen ), BioXpress (Dingerdissen ), Mouse Genome Institute (Bult ), Orthologous Matrix (Altenhoff ), Disease Ontology (Kibbe ), Genomics England PanelApp (Martin ) and the Monarch Initiative (Mungall ). Finally, glycoprotein-related data are integrated from UniCarbKB (Campbell ), PDB (Berman ) and UniProtKB. In addition to these major resources mentioned, other relevant datasets are collected from various research laboratories.

2.2 Dataset standardization, integration and quality control

Data downloaded from various resources are versioned and stored in the GlyGen backend server. The GlyGen knowledgebase maintains strict protein (UniProtKB canonical accessions) and glycan (GlyTouCan accessions) lists, which are used as the protein and glycan primary keys, respectively. Each dataset is mapped to one of these primary keys, and any non-standard identifiers are mapped to their equivalent standard identifiers. Datasets pass through quality control and filtering steps, such as file format sanity, primary accession checks, residue or amino acid position accuracy and various other filtering steps outlined by subject matter experts. The processed dataset is then assigned a GlyGen dataset identifier, and a dataset BioCompute Object (BCO) (Alterovitz ) is created to provide detailed documentation of the data-processing workflow. These datasets can be searched, browsed and downloaded from the GlyGen data page (https://data.glygen.org). A detailed description of data preprocessing and normalization for glycan, protein and glycoprotein datasets is given in the Supplementary Texts S1–S3, the GlyGen data page and dataset sample view page are shown in the Supplementary Figure S1a and b.

2.3 Biocompute Objects for GlyGen datasets

The dataset BCOs are created in conformance to the current BCO specifications (1.3.0) (https://github.com/biocompute-objects/BCO_Specification/tree/master/schemas). A dataset BCO is created with the data integration process perspective to enable capturing all the metadata related to the processing steps performed in the workflow. The dataset BCOs constructed this way can be used as a ‘readme’ for the dataset that provides precise details on how the dataset is integrated. The use of the BCO standard facilitates granular tracking of metadata especially the provenance that helps in providing appropriate attribution and license information that dictates the usage of the dataset, workflow exchange between the researchers and the reproducibility of the dataset. These dataset BCOs are recorded and represented in machine-readable JavaScript Object Notation (JSON) format and can be viewed and downloaded from the GlyGen data page (https://data.glygen.org).

2.4 GlyGen docstore and web services

As mentioned earlier, the GlyGen data integration workflow creates glycan, protein and glycoprotein centric datasets. These datasets are used to generate glycan, protein and glycoprotein centric JSON objects, which are stored in a MongoDB docstore. The GlyGen docstore is used as a backend for various GlyGen web services that are used by the GlyGen frontend as well as other external applications. The GlyGen web services (https://api.glygen.org), which have been documented using the Swagger framework (https://swagger.io/) allow programmatic access of GlyGen data objects for glycans, proteins and glycoproteins. Some of these web services are generic and provide searching, listing and detailed record access functionalities for GlyGen data objects, while others are custom designed to respond to specific biological questions or use cases collected from the user community. The GlyGen API’s webpage is shown in the Supplementary Figure S2.

3 GlyGen data model, triplestore and SPARQL endpoint

All data in the GlyGen project are also available in the Resource Description Framework (RDF) format using namespace from various existing ontologies. The UniProt Core Ontology (Redaschi and UniProt Consortium, 2009) and GlyGen Ontology are used to describe protein-centric data whereas glycan-centric data are described using the GlycoRDF Ontology (Ranzinger ). The GlyGen Ontology along with the Glycoconjugate Ontology (https://github.com/glycoinfo/GlycoCoO) provides the necessary namespace to represent glycoprotein data. A partial view of the GlyGen data model is given in the Supplementary Figure S4, showing a glycoprotein entry linked to a protein sequence and one or many glycosylation sites. A glycosylation site consists of an exact or fuzzy position on a protein sequence that is known to have been glycosylated by a glycan or glycan set. An exact glycosylation site position is linked to the amino acid type that occupies it. The GlyGen knowledgebase uses a Virtuoso triplestore to store GlyGen triple data, and a SPARQL Protocol and RDF Query Language (SPARQL) endpoint (https://sparql.glygen.org) is built to provide programmatic access to the triplestore. The webpage for the GlyGen SPARQL interface and triplestore content statistics for release-1.5 are shown in the Supplementary Figure S3 and Table S1, respectively.

4 Conclusion

This application note has introduced the data model and processing workflow used for building the backend for the GlyGen knowledgebase. In this processing and integration workflow, data are retrieved and extracted from a number of resources and then standardized and harmonized to create clean high-quality datasets. The dataset creation process is fully documented in a form of metadata by creating BCOs. These datasets are further processed to create JSON objects and RDF triples that populate MongoDB docstore and Virtuoso triplestore backend databases, respectively. The docstore is used by various GlyGen web services while the triplestore is accessed through the GlyGen SPARQL endpoint. Click here for additional data file.

17 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. GlyTouCan: an accessible glycan structure repository.

Authors: Michael Tiemeyer; Kazuhiro Aoki; James Paulson; Richard D Cummings; William S York; Niclas G Karlsson; Frederique Lisacek; Nicolle H Packer; Matthew P Campbell; Nobuyuki P Aoki; Akihiro Fujita; Masaaki Matsubara; Daisuke Shinmachi; Shinichiro Tsuchiya; Issaku Yamada; Michael Pierce; René Ranzinger; Hisashi Narimatsu; Kiyoko F Aoki-Kinoshita
Journal: Glycobiology Date: 2017-10-01 Impact factor: 4.313

3. Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data.

Authors: Warren A Kibbe; Cesar Arze; Victor Felix; Elvira Mitraka; Evan Bolton; Gang Fu; Christopher J Mungall; Janos X Binder; James Malone; Drashtti Vasant; Helen Parkinson; Lynn M Schriml
Journal: Nucleic Acids Res Date: 2014-10-27 Impact factor: 16.971

4. BioMuta and BioXpress: mutation and expression knowledgebases for cancer biomarker discovery.

Authors: Hayley M Dingerdissen; John Torcivia-Rodriguez; Yu Hu; Ting-Chia Chang; Raja Mazumder; Robel Kahsay
Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971

5. Database resources of the National Center for Biotechnology Information.

Authors: Eric W Sayers; Richa Agarwala; Evan E Bolton; J Rodney Brister; Kathi Canese; Karen Clark; Ryan Connor; Nicolas Fiorini; Kathryn Funk; Timothy Hefferon; J Bradley Holmes; Sunghwan Kim; Avi Kimchi; Paul A Kitts; Stacy Lathrop; Zhiyong Lu; Thomas L Madden; Aron Marchler-Bauer; Lon Phan; Valerie A Schneider; Conrad L Schoch; Kim D Pruitt; James Ostell
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

6. Mouse Genome Database (MGD) 2019.

Authors: Carol J Bult; Judith A Blake; Cynthia L Smith; James A Kadin; Joel E Richardson
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

7. UniCarbKB: building a knowledge platform for glycoproteomics.

Authors: Matthew P Campbell; Robyn Peterson; Julien Mariethoz; Elisabeth Gasteiger; Yukie Akune; Kiyoko F Aoki-Kinoshita; Frederique Lisacek; Nicolle H Packer
Journal: Nucleic Acids Res Date: 2013-11-13 Impact factor: 16.971

8. PubChem Substance and Compound databases.

Authors: Sunghwan Kim; Paul A Thiessen; Evan E Bolton; Jie Chen; Gang Fu; Asta Gindulyte; Lianyi Han; Jane He; Siqian He; Benjamin A Shoemaker; Jiyao Wang; Bo Yu; Jian Zhang; Stephen H Bryant
Journal: Nucleic Acids Res Date: 2015-09-22 Impact factor: 16.971

9. The OMA orthology database in 2018: retrieving evolutionary relationships among all domains of life through richer web and programmatic interfaces.

Authors: Adrian M Altenhoff; Natasha M Glover; Clément-Marie Train; Klara Kaleb; Alex Warwick Vesztrocy; David Dylus; Tarcisio M de Farias; Karina Zile; Charles Stevenson; Jiao Long; Henning Redestig; Gaston H Gonnet; Christophe Dessimoz
Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971

10. Enabling precision medicine via standard communication of HTS provenance, analysis, and results.

Authors: Gil Alterovitz; Dennis Dean; Carole Goble; Michael R Crusoe; Stian Soiland-Reyes; Amanda Bell; Anais Hayes; Anita Suresh; Anjan Purkayastha; Charles H King; Dan Taylor; Elaine Johanson; Elaine E Thompson; Eric Donaldson; Hiroki Morizono; Hsinyi Tsang; Jeet K Vora; Jeremy Goecks; Jianchao Yao; Jonas S Almeida; Jonathon Keeney; KanakaDurga Addepalli; Konstantinos Krampis; Krista M Smith; Lydia Guo; Mark Walderhaug; Marco Schito; Matthew Ezewudo; Nuria Guimera; Paul Walsh; Robel Kahsay; Srikanth Gottipati; Timothy C Rodwell; Toby Bloom; Yuching Lai; Vahan Simonyan; Raja Mazumder
Journal: PLoS Biol Date: 2018-12-31 Impact factor: 8.029

8 in total

1. Informatics Ecosystems to Advance the Biology of Glycans.

Authors: Lewis J Frey
Journal: Methods Mol Biol Date: 2022

2. Modeling and integration of N-glycan biomarkers in a comprehensive biomarker data model.

Authors: Daniel F Lyman; Amanda Bell; Alyson Black; Hayley Dingerdissen; Edmund Cauley; Nikhita Gogate; David Liu; Ashia Joseph; Robel Kahsay; Daniel J Crichton; Anand Mehta; Raja Mazumder
Journal: Glycobiology Date: 2022-09-19 Impact factor: 5.954

3. Glycomics-informed glycoproteomic analysis of site-specific glycosylation for SARS-CoV-2 spike protein.

Authors: Katelyn E Rosenbalm; Michael Tiemeyer; Lance Wells; Kazuhiro Aoki; Peng Zhao
Journal: STAR Protoc Date: 2020-12-15

4. The human O-GlcNAcome database and meta-analysis.

Authors: Eugenia Wulff-Fuentes; Rex R Berendt; Logan Massman; Laura Danner; Florian Malard; Jeet Vora; Robel Kahsay; Stephanie Olivier-Van Stichelen
Journal: Sci Data Date: 2021-01-21 Impact factor: 6.444

5. Enhancing the interoperability of glycan data flow between ChEBI, PubChem and GlyGen.

Authors: Rahi Navelkar; Gareth Owen; Venkatesh Mutherkrishnan; Paul Thiessen; Tiejun Cheng; Evan Bolton; Nathan Edwards; Michael Tiemeyer; Matthew P Campbell; Maria Martin; Jeet Vora; Robel Kahsay; Raja Mazumder
Journal: Glycobiology Date: 2021-12-18 Impact factor: 5.954

6. Virus-Receptor Interactions of Glycosylated SARS-CoV-2 Spike and Human ACE2 Receptor.

Authors: Peng Zhao; Jeremy L Praissman; Oliver C Grant; Yongfei Cai; Tianshu Xiao; Katelyn E Rosenbalm; Kazuhiro Aoki; Benjamin P Kellman; Robert Bridger; Dan H Barouch; Melinda A Brindley; Nathan E Lewis; Michael Tiemeyer; Bing Chen; Robert J Woods; Lance Wells
Journal: bioRxiv Date: 2020-07-24

Review 7. Three-Dimensional Structures of Carbohydrates and Where to Find Them.

Authors: Sofya I Scherbinina; Philip V Toukach
Journal: Int J Mol Sci Date: 2020-10-18 Impact factor: 5.923

Review 8. Databases and Bioinformatic Tools for Glycobiology and Glycoproteomics.

Authors: Xing Li; Zhijue Xu; Xiaokun Hong; Yan Zhang; Xia Zou
Journal: Int J Mol Sci Date: 2020-09-14 Impact factor: 5.923

8 in total