Literature DB >> 34192308

Glycowork: A Python package for glycan data science and machine learning.

Luc Thomès¹, Rebekka Burkholz², Daniel Bojar¹.

Abstract

While glycans are crucial for biological processes, existing analysis modalities make it difficult for researchers with limited computational background to include these diverse carbohydrates into workflows. Here, we present glycowork, an open-source Python package designed for glycan-related data science and machine learning by end users. Glycowork includes functions to, for instance, automatically annotate glycan motifs and analyze their distributions via heatmaps and statistical enrichment. We also provide visualization methods, routines to interact with stored databases, trained machine learning models and learned glycan representations. We envision that glycowork can extract further insights from glycan datasets and demonstrate this with workflows that analyze glycan motifs in various biological contexts. Glycowork can be freely accessed at https://github.com/BojarLab/glycowork/.

Entities: Chemical

Keywords: Python; data science; glycobioinformatics; glycobiologymachine learning

Mesh：

Substances：
Polysaccharides

Year: 2021 PMID： 34192308 PMCID： PMC8600276 DOI： 10.1093/glycob/cwab067

Source DB: PubMed Journal: Glycobiology ISSN： 0959-6658 Impact factor: 4.313

Introduction

Discovering patterns in biological data requires (i) large datasets and (ii) data science, bioinformatics or machine learning. This combination has led to great advances in systems biology (Chuang et al. 2010; Zou and Laubichler 2018). Usually, there is limited overlap between groups engaging in data collection and those developing algorithms. In mature systems biology fields, this gap is bridged by user-friendly software that facilitates experimental users to analyze their datasets for routine applications, such as with the Bioconductor (Huber et al. 2015) or Biopython (Cock et al. 2009) platforms. Glycobiology—the analysis of glycans in biological contexts (Varki 2017)—has recently seen a surge in data gathering and algorithmic development. Moderately large datasets from glycomics (Cummings and Pierce 2014), glycan arrays (Oyelaran and Gildersleeve 2009) or lectin arrays (Ribeiro and Mahal 2013) can by now be gathered on a rather routine basis, depending on the application. Many algorithms for the analysis of glycan-related data, such as subtree mining for the analysis of glycan array data (Coff et al. 2020; Haab and Klamer 2020) or glycan-focused machine learning (Bojar et al. 2021; Burkholz et al. 2021), have been recently developed. Furthermore, resources in glycobioinformatics have been centralized in the context of the GlySpace Alliance (Aoki-Kinoshita et al. 2020) for increased synergy. This includes a large-scale glycan repository in the form of GlyTouCan and portals, such as GlyCosmos or GlyGen, to various glycobioinformatics resources. Yet while both factors necessary for effective analysis—data and algorithms—are present in glycobiology, most algorithm development efforts are inaccessible to the typical user, who might not be well-versed in computational workflows. While accessible graphical user interfaces for some applications have been developed (Grant et al. 2016; Huang et al. 2021), these approaches often lack the flexibility and throughput that is required for many analyses. Thus, with the exception of platforms such as glypy (Klein and Zaia 2019), geared more toward analyzing glycan-focused mass spectrometry, glycobioinformatics methods cannot be used with the same accessibility that bioinformatics procedures exhibit in other systems biology disciplines. Therefore, we have developed glycowork, a computational framework designed to be accessible to end users with minimal computational background. While background functions that work with glycans as graphs are available to experts, we provide high-level wrapper functions for analyses that only require the input of glycans in a human-readable format, such as the IUPAC-condensed format. Glycowork is open-source (https://github.com/BojarLab/glycowork/) and we have prepared an extensive documentation with example workflows (https://bojarlab.github.io/glycowork/). We envision glycowork to advance glycobioinformatics, distilling insights from the increasing number of available glycan datasets.

Glycowork—Principles and applications

Glycowork is written in the Python programming language (version 3.6+) and uses pandas dataframes, lists of glycans or single glycans as inputs for its functions. We have structured glycowork into four modules: data loading and handling (glycan_data), sequence alignment (alignment), sequence processing and motif analysis (motif), and glycan-focused machine learning (ml) (Figure 1A).

Fig. 1

Structure of the glycowork package. (A) Modular structure of glycowork. Modules are depicted as boxes containing submodules. Dependencies between modules are indicated by connecting lines. (B) Workflow of the glycan_to_nxGraph function from the glycowork.motif.graph submodule. An example glycan in IUPAC-condensed notation is converted into a graph. The resulting edge, node and position lists are shown, with “T” indicating a terminal position and “I” indicating an internal position. (C) Workflow of the annotate_glycan function from the glycowork.motif.annotate submodule. Glycan graphs and graphs for known motifs are used to identify occurring motifs via subgraph isomorphism tests. Functions in glycowork use glycans in the IUPAC-condensed nomenclature as input and convert these to graph objects (Burkholz et al. 2021) for further processing and analysis (Figure 1B). Crucially, this step removes the ambiguity of the IUPAC nomenclature, as the uniqueness of graphs can be tested. We decided against using other nomenclatures, such as GlycoCT (Herget et al. 2008) or WURCS (Tanaka et al. 2014), as we argue that combining a human-readable nomenclature (IUPAC-condensed) with a machine-readable nomenclature (glycan graphs) is necessary and sufficient for all relevant tasks in glycobioinformatics. Furthermore, working with glycans as graphs allowed us to leverage advances in graph theory that have accrued over decades of research, such as the NetworkX package (Hagberg et al. 2008) that can be applied without modifications to our glycan graphs. Glycan graphs consist of the connectivities in a glycan (edge list), the contained monosaccharides/linkages (node labels) and the information whether these are internal or terminal (position labels; Figure 1B). Glycowork is designed so that users with limited bioinformatics experience can exclusively work with glycans in IUPAC-condensed nomenclature, while all graph operations proceed in the background. We are confident that this will facilitate the accessibility of our open-source package. Many functions in glycowork leverage the power of graph theory, for instance by unambiguously annotating glycan motifs via subgraph isomorphism using the annotate_glycan function (Figure 1C). The concept of graph isomorphism tests whether two graph objects describe the same graph (similar to how two IUPAC-condensed descriptions of a glycan can describe the same glycan), in the sense that monosaccharides have the same neighbors in both graphs, which can be used to detect the presence of a subgraph, a motif, in a glycan graph. In glycowork, this is done using the position labels, to ensure that motifs such as the O-glycan core motifs are only recognized at the reducing end. Glycowork comes equipped with 150 named motifs from the academic literature and can also analyze relevant disaccharide motifs, as further described below. Glycowork contains continuously updated glycan datasets, such as glycan array data of influenza viruses or information about species-specific glycans. These could be used for developing algorithms or uncovering new insights into glycan properties. Additionally, learned representations of glycans from a deep learning model (Burkholz et al. 2021) are provided, allowing us to visualize clusters of similar glycans (Figure 2A). Furthermore, glycowork contains functions to cluster groups via heatmaps according to the presence and abundance of glycan motifs.

Fig. 2

Example workflows from glycowork. (A) Investigating N-linked glycans in animals. The plot_embeddings function displays N-linked glycans, with colors corresponding to taxonomic phyla. The make_heatmap function displays glycan motif distributions for each phylum. (B) Analysis of rhamnose sequence neighborhood in bacteria. Bacterial glycans are colored based on the presence of rhamnose (Rha). Proportions of rhamnose and its variants (left) and their observed neighboring monosaccharides (right), as stacked bar graphs, are visualized via the characterize_monosaccharide function. (C) Glycan-binding specificities of influenza viruses. Measured glycan-binding of various influenza strains is represented as a heatmap. The get_pvals_motifs function displays, for each motif, a P-value and a corrected P-value. Shown are the top 10 motifs, with the full table available in Table SI. (D) Glycan classification using machine learning. The train_ml_model function constructs a model to discriminate between “animal” and “non-animal” glycans. The analyze_ml_model function displays important criteria for glycan classification. Full-scale heatmaps shown in A and C are found in Figures S1 and S2. Beside these analyses, users can analyze the neighboring sequence of a monosaccharide in a specific taxonomic group such as bacteria (Figure 2B). Comparing the sequence context of different groups of interest might shed light onto evolutionary or functional differences in their glycans. Here, we show this type of analysis with the sequence neighborhood of the monosaccharide rhamnose (Rha) in bacteria. This yields the observation that Rha, D-Rha and D-RhaNAc are all typically found in homogeneous sequence environments (i.e. connected to more Rha/D-Rha/D-RhaNAc, respectively). Glycan arrays are a common method to determine viral glycan-binding specificity (Smith and Cummings 2014). Glycowork can analyze this data type by generating motif-based heatmaps, in which motifs are colored by their associated Z-score—in this case illustrating the split between Neu5Ac(α2–3)-binding avian influenza viruses and Neu5Ac(α2–6)-preferring mammalian influenza viruses (Figure 2C). This can be extended by identifying statistically significant binding motifs (Figure 2C), which points to the importance of sialic acid-containing motifs as reported previously (Viswanathan et al. 2010). Another example of the functionalities of glycowork is glycan-focused machine learning. By providing a list of glycans and corresponding labels, glycowork trains machine learning models with a single line of code. As an example, we trained a model to predict whether a glycan stems from an animal or a different organism and then analyzed the model as to which motifs were most predictive for this classification (Figure 2D). In this case, the presence of type-2 LacNAc (Gal(β1–4)GlcNAc) was most predictive for an animal glycan. Analogously, state-of-the-art deep learning models can be trained with only a few lines of codes, for which we direct the user to the full documentation of glycowork.

Conclusion

As the most diverse biological sequence, glycans require dedicated analyses. Until now, technical limitations have directed the focus of researchers to rather short and/or uniform glycans that are amenable to manual analysis, such as N-glycans or short O-glycans. Yet with the addition of more (complex) sequences (Malaker et al. 2021), and the combination of glycans with systems biology data (Kearney et al. 2021), manual analysis is becoming increasingly unrealistic. We envision that accessible analysis platforms such as glycowork will allow researchers to connect knowledge from different glycobiology areas to fuel discoveries and extend the scope of already known phenomena. We are planning to improve glycowork in future work by expanding its functionalities. This includes implementations of more existing glycobioinformatics techniques, such as the Multiple Carbohydrate Alignment with Weights tool (Hosoda et al. 2017). We will also update the stored datasets in glycowork as new glycans become available, to maximize the utility of sequence context analysis, database queries and others. We encourage interested readers to find more details and examples in the documentation of glycowork (https://bojarlab.github.io/glycowork/). We also would like to invite the community to suggest—or even implement—changes, improvements or additions, to maximize the utility of glycowork for glycobioinformatics and allow researchers to include glycan data analysis into their routine workflows.

Authors’ contributions

Conceptualization: D.B., Data Curation: L.T., R.B., D.B., Funding Acquisition: D.B., Investigation: L.T., D.B., Resources: D.B., Software: L.T., R.B., D.B., Supervision: D.B., Visualization: L.T., D.B., Writing—Original Draft Preparation: L.T., D.B., Writing—Review & Editing: L.T., R.B., D.B. Click here for additional data file.

22 in total

Review 1. Orchestrating high-throughput genomic analysis with Bioconductor.

Authors: Wolfgang Huber; Vincent J Carey; Robert Gentleman; Simon Anders; Marc Carlson; Benilton S Carvalho; Hector Corrada Bravo; Sean Davis; Laurent Gatto; Thomas Girke; Raphael Gottardo; Florian Hahne; Kasper D Hansen; Rafael A Irizarry; Michael Lawrence; Michael I Love; James MacDonald; Valerie Obenchain; Andrzej K Oleś; Hervé Pagès; Alejandro Reyes; Paul Shannon; Gordon K Smyth; Dan Tenenbaum; Levi Waldron; Martin Morgan
Journal: Nat Methods Date: 2015-02 Impact factor: 28.547

2. The GlySpace Alliance: toward a collaborative global glycoinformatics community.

Authors: Kiyoko F Aoki-Kinoshita; Frederique Lisacek; Raja Mazumder; William S York; Nicolle H Packer
Journal: Glycobiology Date: 2020-01-28 Impact factor: 4.313

3. Deep-Learning Resources for Studying Glycan-Mediated Host-Microbe Interactions.

Authors: Daniel Bojar; Rani K Powers; Diogo M Camacho; James J Collins
Journal: Cell Host Microbe Date: 2020-10-28 Impact factor: 21.023

Review 4. Investigating virus-glycan interactions using glycan microarrays.

Authors: David F Smith; Richard D Cummings
Journal: Curr Opin Virol Date: 2014-07-01 Impact factor: 7.090

5. GlycoCT-a unifying sequence format for carbohydrates.

Authors: S Herget; R Ranzinger; K Maass; C-W V D Lieth
Journal: Carbohydr Res Date: 2008-03-13 Impact factor: 2.104

6. WURCS: the Web3 unique representation of carbohydrate structures.

Authors: Kenichi Tanaka; Kiyoko F Aoki-Kinoshita; Masaaki Kotera; Hiromichi Sawaki; Shinichiro Tsuchiya; Noriaki Fujita; Toshihide Shikanai; Masaki Kato; Shin Kawano; Issaku Yamada; Hisashi Narimatsu
Journal: J Chem Inf Model Date: 2014-06-04 Impact factor: 4.956

7. Global mapping of glycosylation pathways in human-derived cells.

Authors: Yi-Fan Huang; Kazuhiro Aoki; Sachiko Akase; Mayumi Ishihara; Yi-Shi Liu; Ganglong Yang; Yasuhiko Kizuka; Shuji Mizumoto; Michael Tiemeyer; Xiao-Dong Gao; Kiyoko F Aoki-Kinoshita; Morihisa Fujita
Journal: Dev Cell Date: 2021-03-16 Impact factor: 12.270

8. Development and application of an algorithm to compute weighted multiple glycan alignments.

Authors: Masae Hosoda; Yukie Akune; Kiyoko F Aoki-Kinoshita
Journal: Bioinformatics Date: 2017-05-01 Impact factor: 6.937

9. SUGAR-seq enables simultaneous detection of glycans, epitopes, and the transcriptome in single cells.

Authors: Conor J Kearney; Stephin J Vervoort; Kelly M Ramsbottom; Izabela Todorovski; Emily J Lelliott; Magnus Zethoven; Lizzy Pijpers; Ben P Martin; Timothy Semple; Luciano Martelotto; Joseph A Trapani; Ian A Parish; Nichollas E Scott; Jane Oliaro; Ricky W Johnstone
Journal: Sci Adv Date: 2021-02-19 Impact factor: 14.136

10. Identifying glycan motifs using a novel subtree mining approach.

Authors: Lachlan Coff; Jeffrey Chan; Paul A Ramsland; Andrew J Guy
Journal: BMC Bioinformatics Date: 2020-02-04 Impact factor: 3.169

4 in total

Review 1. Immunoglobulin G N-glycan Biomarkers for Autoimmune Diseases: Current State and a Glycoinformatics Perspective.

Authors: Konstantinos Flevaris; Cleo Kontoravdi
Journal: Int J Mol Sci Date: 2022-05-06 Impact factor: 6.208