| Literature DB >> 32019496 |
Lachlan Coff1, Jeffrey Chan1, Paul A Ramsland1,2,3, Andrew J Guy4.
Abstract
BACKGROUND: Glycans are complex sugar chains, crucial to many biological processes. By participating in binding interactions with proteins, glycans often play key roles in host-pathogen interactions. The specificities of glycan-binding proteins, such as lectins and antibodies, are governed by motifs within larger glycan structures, and improved characterisations of these determinants would aid research into human diseases. Identification of motifs has previously been approached as a frequent subtree mining problem, and we extend these approaches with a glycan notation that allows recognition of terminal motifs.Entities:
Keywords: Carbohydrate; Frequent subtree mining; Glycan; Glycobiology; Machine learning; Microarray; Motif
Mesh:
Substances:
Year: 2020 PMID: 32019496 PMCID: PMC7001330 DOI: 10.1186/s12859-020-3374-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Workflow for identification of key binding motifs from glycan microarray data and construction of predictive classifier
Fig. 2Addition of restricted linkage nodes improves selection of candidate motifs for glycan binding data. In this illustrative example, there is a single glycan (Gal β1-3GalNAc) capable of binding to a candidate lectin (e.g. PNA), while sialylation of the galactose residue (Neu5Ac α2-3Gal β1-3GalNAc and Neu5Ac α2-6Gal β1-3GalNAc) restricts binding. Generation of subtrees from these three glycans yields a set of potential motifs that could be used to discriminate between binders and non-binders. Note that one of these subtrees contains a ’restricted linkage’ node, to indicate the absence of a connection at positions 3 and 6 on the terminal galactose; there are connections at these positions within the non-binding set. This restricted linkage node is indicated by an X. Without consideration of restricted linkage nodes, there are no subtrees that are unique to the binding set. However, with addition of restricted linkage nodes, there is a single subtree from the binding set that adequately discriminates between binding and non-binding glycans. This candidate motif is marked with an asterisk. All glycan motif structures are shown in SNFG [51], modified with restricted linkages. Each restricted linkage, with corresponding carbon numbers, terminates in a cross in place of a residue symbol, according to the key
Classification performance and identified motifs for common lectins
| Lectin | Conc. ( | AUC (Validation) | AUC (Train) | Top Motif* | |
|---|---|---|---|---|---|
| 100 | 0.934 (0.034) | 0.947 (0.006) | (*3,4,6)GlcNAc | ||
| Concanavalin A (Con A) | 10 | 0.971 (0.031) | 0.982 (0.015) | Man | |
| 100 | 0.839 (0.069) | 0.897 (0.042) | (*3,4,6)GalNAc | ||
| Human DC-SIGN tetramer | 200 | 0.841 (0.062) | 0.955 (0.026) | Man | |
| 10 | 0.867 (0.061) | 0.953 (0.014) | (*2,3,4,6)Gal | ||
| Influenza hemagglutinin (HA) (A/Puerto Rico/8/34) (H1N1) | 200 | 0.913 (0.105) | 0.973 (0.023) | (*8,9)Neu5Ac | |
| Influenza HA (A/harbor seal/Massachusetts/1/2011) (H3N8) | 200 | 0.959 (0.028) | 0.958 (0.007) | (*8,9)Neu5Ac | |
| Jacalin | 1 | 0.882 (0.055) | 0.896 (0.009) | (*4,6)GalNAc | |
| 10 | 0.964 (0.032) | 0.976 (0.008) | Man | ||
| 10 | 0.833 (0.035) | 0.848 (0.053) | (*2,4,6)Gal | ||
| 10 | 0.718 (0.078) | 0.814 (0.074) | Gal | ||
| 10 | 0.959 (0.018) | 0.975 (0.009) | (*2,4,6)Gal | ||
| 10 | 0.914 (0.126) | 0.967 (0.030) | GlcNAc | ||
| Peanut agglutinin (PNA) | 10 | 0.914 (0.048) | 0.943 (0.021) | (*2,3,4,6)Gal | |
| 10 | 0.890 (0.053) | 0.929 (0.028) | Man | ||
| 10 | 0.953 (0.026) | 0.958 (0.008) | (*2,3,4,6)Gal | ||
| Soybean agglutinin (SBA) | 10 | 0.875 (0.061) | 0.938 (0.026) | (*3,4,6)GalNAc | |
| 10 | 0.950 (0.060) | 0.979 (0.010) | Neu5Ac | ||
| 100 | 0.861 (0.049) | 0.895 (0.042) | (*3)Fuc | ||
| Wheat germ agglutinin (WGA) | 1 | 0.882 (0.021) | 0.901 (0.004) | GlcNAc |
Model performance was assessed using stratified 5-fold cross-validation, with Area Under the Curve (AUC) values calculated for both validation and training folds (shown as mean (s.d.)). The top motif is defined as the feature with the highest coefficient in the logistic regression classification model, and is shown for a single test/training split. Experimentally determined lectin specificities and associated citations are provided in Additional file 7
*Note: Motifs are written in a modified CFG linear text nomenclature. A set of parentheses with connection types preceded by an asterisk indicates restricted connection types for the following residue. For example, a GlcNAc motif with restricted connections on C3 and C4 is indicated by (*3,4)GlcNAc
Fig. 3Predicted carbohydrate-binding motifs of PNA from CFG glycan microarray data. a Distribution of RFUs and classification of non-binding (blue), intermediate binding (orange), and binding glycans (red). b ROC curves for the test (n=143) and training (n=428) sets. The ratio of negative to positive samples was 9.0. c Logistic regression coefficients for identified motifs. d The intermolecular hydrogen bonding interactions (shown in green) between the T antigen (carbon backbone shown in yellow) and the carbohydrate-binding domain of peanut agglutinin (PNA) (carbon backbones shown in grey). Carbon 3 of the Gal monomer is labelled to indicate where the sialic acid is linked in the sialyl T antigen. Reproduced from an X-ray crystal structure at 2.5 Å resolution available at the PDB (PDB: 2TEP) [28]. See Additional file 1 for a detailed notation key
Fig. 4Predicted carbohydrate-binding motifs of Con A from CFG glycan microarray data. a Distribution of RFUs and classification of non-binding (blue), intermediate binding (orange), and binding glycans (red). b ROC curves for the test (n=141) and training (n=421) sets. The ratio of negative to positive samples was 4.1. c Logistic regression coefficients for identified motifs. d The intermolecular hydrogen bonding interactions (shown in green) between 2 α-mannobiose (carbon backbone shown in yellow) and the carbohydrate-binding domain of Concanavalin A (carbon backbones shown in grey). Reproduced from an X-ray crystal structure at 1.2 Å resolution available at the Protein Data Bank (PDB: 1I3H) [52]. See Additional file 1 for a detailed notation key
Fig. 5Predicted carbohydrate-binding motifs of RCA I from CFG glycan microarray data. a Distribution of RFUs and classification of non-binding (blue), intermediate binding (orange), and binding glycans (red). b ROC curves for the test (n=125) and training (n=372) sets. The ratio of negative to positive samples was 4.4. c Logistic regression coefficients for identified motifs. d The intermolecular hydrogen bonding interactions (shown in green) between β-galactose (carbon backbone shown in yellow) and the carbohydrate-binding domain of the B chain of ricin (carbon backbones shown in grey). Reproduced from an X-ray crystal structure at 2.5 Å resolution available at the PDB (PDB: 3RTI) [39]. See Additional file 1 for a detailed notation key
Fig. 6Predicted carbohydrate-binding motifs of two haemagglutinins from a human and an avian strain of influenza from CFG glycan microarray data. a Distribution of RFUs and classification of non-binding (blue), intermediate binding (orange), and binding glycans (red) for A/Puerto Rico/8/34 (H1N1) HA. b ROC curves for the test (n=138) and training (n=412) sets for A/Puerto Rico/8/34 (H1N1) HA. The ratio of negative to positive samples was 26.5. c Logistic regression coefficients for identified motifs for A/Puerto Rico/8/34 (H1N1) HA. d Distribution of RFUs and classification of non-binding (blue), intermediate binding (orange), and binding glycans (red) for A/harbor seal/Massachusetts/1/2011 (H3N8) HA. e ROC curves for the test (n=145) and training (n=433) sets for A/harbor seal/Massachusetts/1/2011 (H3N8) HA. The ratio of negative to positive samples was 11.4. f Logistic regression coefficients for identified motifs for A/harbor seal/Massachusetts/1/2011 (H3N8) HA. See Additional file 1 for a detailed notation key
Fig. 7Classification performance across a range of different lectins. a Receiver-operator characteristic (ROC) curves across a number of different glycan microarray experiments. Individual ROC curves are shown in light blue. The median ROC curve is shown in black, with shading representing 25th-75th percentiles. The dashed line indicates an uninformative (random) classifier. b Area Under the Curve (AUC) values for all glycan microarray experiments examined. See Table 1 and Additional file 5 for a full list of lectins examined. c Classification performance of CCARL compared to existing glycan motif tools. Area Under the Curve (AUC) values were calculated across a number of different glycan microarray experiments using stratified 5-fold cross-validation (with the exception of MotifFinder, which was evaluated using a single fold). Motifs were extracted using GLYMMR, MotifFinder, the Glycan Miner Tool and CCARL, and assessed using a logistic regression model (with the exception of MotifFinder, which outputs predicted RFU values). Motifs from GLYMMR were extracted at several minimum support thresholds, and both the mean AUC value and best AUC value reported for each microarray experiment. Median and interquartile range are indicated by solid and dashed grey lines respectively
Comparison of classifier performance across different motif generation tools
| Lectin | GLYMMR(mean) | GLYMMR(best) | Glycan Miner Tool | MotifFinder | CCARL |
|---|---|---|---|---|---|
| 0.607 (0.151) | 0.776 (0.088) | 0.888 (0.067) | 0.905 | ||
| Concanavalin A (Con A) | 0.760 (0.083) | 0.875 (0.048) | 0.951 (0.042) | 0.937 | |
| 0.630 (0.098) | 0.674 (0.126) | 0.722 (0.083) | 0.839 (0.069) | ||
| Human DC-SIGN tetramer | 0.634 (0.132) | 0.727 (0.125) | 0.823 (0.130) | 0.538 | |
| 0.773 (0.103) | 0.847 (0.086) | 0.867 (0.061) | |||
| Influenza hemagglutinin (HA) (A/Puerto Rico/8/34) (H1N1) | 0.851 (0.140) | 0.889 (0.103) | 0.838 (0.144) | 0.643 | |
| Influenza HA (A/harbor seal/Massachusetts/1/2011) (H3N8) | 0.925 (0.059) | 0.935 (0.034) | 0.947 (0.021) | 0.717 | |
| Jacalin | 0.782 (0.061) | 0.804 (0.050) | 0.848 (0.026) | 0.726 | |
| 0.772 (0.092) | 0.811 (0.083) | 0.908 (0.083) | 0.832 | ||
| 0.700 (0.054) | 0.758 (0.057) | 0.868 (0.050) | 0.833 (0.035) | ||
| 0.600 (0.162) | 0.827 (0.056) | 0.830 | 0.721 (0.073) | ||
| 0.817 (0.061) | 0.875 (0.044) | 0.910 (0.016) | 0.496 | ||
| 0.805 (0.095) | 0.829 (0.089) | 0.858 (0.110) | 0.636 | ||
| Peanut agglutinin (PNA) | 0.668 (0.116) | 0.751 (0.133) | 0.894 (0.041) | 0.617 | |
| 0.796 (0.070) | 0.830 (0.050) | 0.858 (0.064) | 0.694 | ||
| 0.696 (0.053) | 0.751 (0.032) | 0.848 (0.034) | 0.909 | ||
| Soybean agglutinin (SBA) | 0.542 (0.061) | 0.582 (0.049) | 0.781 (0.046) | 0.775 | |
| 0.962 (0.051) | 0.962 (0.050) | 0.820 | 0.961 (0.059) | ||
| 0.703 (0.099) | 0.734 (0.057) | 0.866 (0.023) | 0.859 (0.047) | ||
| Wheat germ agglutinin (WGA) | 0.663 (0.048) | 0.697 (0.055) | 0.831 (0.034) | 0.817 |
Model performance was assessed using stratified 5-fold cross-validation, with mean Area Under the Curve (AUC) values calculated across all validation folds (shown as mean (s.d.)). The best performing tool for each sample is highlighted in bold. Note the MotifFinder tool was evaluated with a single test-train split due to difficulty automating this tool. GLYMMR was evaluated across a range of minimum support thresholds, with AUC values reported for the best threshold as well as mean AUC values across all thresholds