Literature DB >> 32576591

Open Database Searching Enables the Identification and Comparison of Bacterial Glycoproteomes without Defining Glycan Compositions Prior to Searching.

Ameera Raudah Ahmad Izaham¹, Nichollas E Scott².

Abstract

Mass spectrometry has become an indispensable tool for the characterization of glycosylation across biological systems. Our ability to generate rich fragmentation of glycopeptides has dramatically improved over the last decade yet our informatic approaches still lag behind. Although glycoproteomic informatics approaches using glycan databases have attracted considerable attention, database independent approaches have not. This has significantly limited high throughput studies of unusual or atypical glycosylation events such as those observed in bacteria. As such, computational approaches to examine bacterial glycosylation and identify chemically diverse glycans are desperately needed. Here we describe the use of wide-tolerance (up to 2000 Da) open searching as a means to rapidly examine bacterial glycoproteomes. We benchmarked this approach using N-linked glycopeptides of Campylobacter fetus subsp. fetus as well as O-linked glycopeptides of Acinetobacter baumannii and Burkholderia cenocepacia revealing glycopeptides modified with a range of glycans can be readily identified without defining the glycan masses before database searching. Using this approach, we demonstrate how wide tolerance searching can be used to compare glycan use across bacterial species by examining the glycoproteomes of eight Burkholderia species (B. pseudomallei; B. multivorans; B. dolosa; B. humptydooensis; B. ubonensis, B. anthina; B. diffusa; B. pseudomultivorans). Finally, we demonstrate how open searching enables the identification of low frequency glycoforms based on shared modified peptides sequences. Combined, these results show that open searching is a robust computational approach for the determination of glycan diversity within bacterial proteomes.

Entities: CellLine Chemical Disease Mutation Species

Keywords: Glycoproteomics; N-glycosylation; bacteria; bioinformatics searching; glycomics; glycoprotein structure; glycoproteins; glycosylation; open searching; post-translational modifications

Year: 2020 PMID： 32576591 PMCID： PMC8143609 DOI： 10.1074/mcp.TIR120.002100

Source DB: PubMed Journal: Mol Cell Proteomics ISSN： 1535-9476 Impact factor: 5.911

Protein glycosylation, the addition of carbohydrates to proteins, is a widespread and heterogeneous class of protein modifications (1, 2, 3). Within Eukaryotes, multiple glycosylation systems have been identified (1, 2, 3) and up to 20% of the proteome is thought to be subjected to this class of modifications (4). Within Eukaryotes, both N-linked and O-linked glycosylation systems are known to generate highly heterogeneous glycan structures (2, 3) with this glycan heterogeneity important for the function of glycoproteins (5, 6). Although the glycan repertoire used in eukaryotic systems is thought to be large, the diversity within any given biological sample is constrained by the limited number of monosaccharides used in eukaryotic systems (7), as well as the expression of proteins required for the construction of glycans such as glycosyltransferases (8). Experimentally, these constraints lead to only a limited number of glycans being produced across eukaryotic samples (9, 10) despite the large number of potential glycan structures (11, 12). This limited diversity within both eukaryotic N-linked and O-linked glycans has enabled the development of glycan databases which have facilitated high throughput glycoproteomic studies (13) using tools such as Byonic (14) and pGlyco (15). Unfortunately, these databases are not suitable for all glycosylation systems and fail to identify glycopeptides modified with novel or atypical glycans such as those found in bacterial glycosylation systems. Within bacterial systems, glycosylation is increasingly recognized as a common modification (16, 17, 18, 19). Although glycosylation in bacteria was first identified in the 1970s (20), it is only within the last two decades that it has become clear that this class of modifications is ubiquitous across bacterial genera (16, 18, 21). Unlike eukaryotic systems, which use a relatively small set of monosaccharides, bacterial glycoproteins are decorated with a diverse range of monosaccharides (22) leading to a staggering array of glycan structures (23, 24, 25, 26, 27, 28, 29, 30, 31, 32). This glycan diversity represents a significant challenge to the field as it makes the identification of novel bacterial glycoproteins a nontrivial analytical undertaking. Yet, through advancements in mass spectrometry (MS) (28, 30, 33, 34), these once obscure modifications are increasingly recognizable and are now known to be essential for bacterial fitness (26, 35, 36, 37, 38). Despite our ability to generate rich bacterial glycopeptide data the field still largely uses manual interrogation to identify and characterize novel glycosylation systems (23, 24, 25, 26, 27, 28, 29, 30, 31, 32). This dependence on manual interrogation is not scalable, time-consuming and prone to human error, especially in the detection of glycoform heterogeneity. This is exemplified in our own experience characterizing glycosylation in Acinetobacter baumannii where our initial analysis overlooked alternative methylated and deacetylated forms of glucuronic acid (26). Thus, new approaches are needed to ensure bacterial glycosylation studies can be undertaken in a robust and high-throughput manner. Wide precursor mass tolerance database searching, also known as “open” or wildcard searching, is an increasingly popular approach for the detection of protein modifications within proteomic data sets (39, 40, 41, 42, 43). The underlying premise of this approach is that by allowing a wide precursor mass tolerance, modified peptides can be detected by the difference in their observed mass from their expected mass. Importantly, this makes the identification of modifications independent of needing to define the modification in the initial search parameters. This approach has been used to examine chemical modifications such as formylation (44) and mis-alkylation (45) as well as large modifications such as DNA-peptide cross-links (43) and eukaryotic glycosylation (46, 47). Although this approach is effective, it is not without tradeoffs being computationally more expensive than traditional searches leading to longer search times (48). To date, these searches have typically been undertaken using ±500 Da tolerances (39, 40, 41, 42, 43) yet large delta mass windows of ±1000Da (43, 46) and even +3000Da (46, 47) have been reported. Despite the growing application of open database searching in eukaryotic proteomics, few bacterial studies have used this technique. This said, alternative strategies such as dependent peptide searching have been used in bacteria to track misincorporation of amino acids (49) and identify novel forms of glycosylation such as arginine-rhammnosylation (50). In this study, we demonstrate that wide mass (up to 2000 Da) open database searching enables the rapid identification of bacterial glycopeptides without the need to assign glycan masses before database searching. Using Byonic™, which enables both glycopeptide and open database searching (14, 51), we benchmark wide mass open searching on three previously characterized bacterial glycosylation systems: the N-linked glycosylation system of Campylobacter fetus subsp. fetus NCTC10842 (25); the O-linked glycosylation system of Acinetobacter baumannii ATCC17978 (26, 52, 53); and the O-linked glycosylation system of Burkholderia cenocepacia J2315 (23, 37). Each of these bacteria have increasingly complex proteomes (ranging from 1600 to nearly 7000 proteins) enabling us to assess the performance of open database searching across a range of proteome sizes. We find open database searching readily enabled previously characterized glycoforms and microheterogeneity to be identified across all samples. Applying this approach to representative species of the Burkholderia genus (23, 37), we provide the first snapshot of glycosylation across this genus. Consistent with the conservation of the biosynthetic pathway responsible for the Burkholderia O-linked glycans (23) all Burkholderia species examined predominately modify their glycoproteins with two glycan structures of similar composition. Excitingly, we demonstrate that open searching also enables low frequency glycoforms to be detected, highlighting that species-specific glycan structures do exist in Burkholderia. Thus, open database searching enables the identification of diverse bacterial glycan structures in a high-throughput manner.

Bacterial Strains and Growth Conditions

C. fetus subsp. fetus NCTC 10842 was grown on Brain-Heart Infusion medium (Hardy Diagnostics) with 5% defibrinated horse blood (Hemostat, Dixon, CA) under microaerobic conditions (10% CO2, 5% O2, 85% N2) at 37 °C as previously reported (25). Burkholderia pseudomallei K96243 was grown as previously reported (54) in Luria Bertani (LB) broth. All other bacterial strains were grown overnight on LB agar at 37 °C as previously described (37). Details on the strains, their origins, references and proteome databases used in this study are provided within Table I.

Table I

Strain list

Strains	Source (description, country, year)	Reference	Proteome database
Campylobacter fetus subsp. fetus NCTC 10842	Brain of sheep fetus, France, 1952	(79)	Uniprot Database: UP000001035
Acinetobacter baumannii ATCC17978	Fatal meningitis of a 4-month old infant, 1951	(80)	GenBank assembly accession: GCA_001593425.2
Burkholderia pseudomallei K96243	Human clinical specimen, Thailand, 1996	(81)	Uniprot Database: UP000000605
Burkholderia cenocepacia (LMG 16656/J2315)	Human clinical specimen, United Kingdom, 1989	(82)	Uniprot Database: UP000001035
Burkholderia multivorans MSMB2008	Soil isolate, Australia, 2012	(83)	Burkholderia Genome Database (84), Strain number: 3016
Burkholderia dolosa AU0158	Human clinical specimen, USA unknown	(85)	Uniprot database: UP000032886
Burkholderia humptydooensis MSMB43	Water isolate, Australia, unknown	(83, 86)	Burkholderia Genome Database (84), Strain number: 4072
Burkholderia ubonensis MSMB22	Soil isolate, Australia, 2001	(85)	Burkholderia Genome Database (84), Strain number: 3404
Burkholderia anthina MSMB649	Soil isolate, Australia, 2010	(83)	Burkholderia Genome Database (84), Strain number: 2849
Burkholderia diffusa MSMB375	Water isolate, Australia, 2008	(83)	Burkholderia Genome Database (84), Strain number: 2966
Burkholderia pseudomultivorans MSMB2199	Soil isolate, Australia, 2011	(83)	Burkholderia Genome Database (84), Strain number: 3251

Strain list

Generation of Bacterial Lysates for Glycoproteome Analysis

Bacterial strains were grown to confluency on agar plates before being flooded with 5 ml of pre-chilled sterile PBS (PBS) and bacterial cells collected by scraping. Cells were washed 3 times in PBS to remove media contaminates, then collected by centrifugation at 10,000 × g at 4˚C for 10 min and then snap frozen. Snap frozen cell samples were resuspended in 4% SDS, 100 mm Tris pH 8.0, 20 mm Dithiothreitol (DTT) and boiled at 95˚C with shaking at 2000 rpm for 10 min. Samples were clarified by centrifugation at 17,000 × g for 10 min, the supernatants then collected, and protein concentration determined by a bicinchoninic acid assay (Thermo Fisher Scientific, Waltham, MA). 1 mg of protein from each sample was acetone precipitated by mixing one volume of sample with 4 volumes of ice-cold acetone. Samples were precipitated overnight at -20˚C and then spun down at 16,000 × g for 10 min at 0˚C. The precipitated protein pellets were resuspended in 80% ice-cold acetone and precipitated for an additional 4 h at −20˚C. Samples were centrifuged at 17,000 × g for 10 min at 0˚C, the supernatant discarded, and excess acetone driven off at 65˚C for 5 min Three biological replicates of each bacterial strain were prepared.

Digestion of Protein Samples

Protein digestion was undertaken as previously described with minor alterations (28). Briefly, dried protein pellets were resuspended in 6 M urea, 2 M thiourea in 40 mm NH4HCO3 then reduced for 1 h with 20 mm DTT followed by being alkylated with 40 mm chloroacetamide for 1 h in the dark. Samples were then digested with Lys-C (1/200 w/w) for 3 h before being diluted with 5 volumes of 40 mm NH4HCO3 and digested overnight with trypsin (1/50 w/w). Digested samples were acidified to a final concentration of 0.5% formic acid and desalted with 50 mg tC18 Sep-Pak columns (Waters corporation, Milford, MA) according to the manufacturer's instructions. tC18 Sep-Pak columns were conditioned with 10 bed volumes of Buffer B (0.1% formic acid, 80% acetonitrile), then equilibrated with 10 bed volumes of Buffer A* (0.1% TFA, 2% acetonitrile) before use. Samples were loaded on to equilibrated columns then columns washed with at least 10 bed volumes of Buffer A* before bound peptides were eluted with Buffer B. Eluted peptides were dried by vacuum centrifugation and stored at −20˚C.

ZIC-HILIC Enrichment of Bacterial Glycopeptides

ZIC-HILIC enrichment was performed as previously described with minor modifications (28). ZIC-HILIC Stage-tips (55) were created by packing 0.5 cm of 10 μm ZIC-HILIC resin (Millipore, Burlington, MA) into p200 tips containing a frit of C8 Empore™ (Sigma, St. Louis, MO) material. Before use, the columns were washed with ultrapure water, followed by 95% acetonitrile and then equilibrated with 80% acetonitrile and 5% formic acid. Digested proteome samples were resuspended in 80% acetonitrile and 5% formic acid. Samples were adjusted to a concentration of 3 µg/µL (a total of 300 µg of peptide used for each enrichment) then loaded onto equilibrated ZIC-HILIC columns. ZIC-HILIC columns were washed with 20 bed volumes of 80% acetonitrile, 5% formic acid to remove nonglycosylated peptides and bound peptides eluted with 10 bed volumes of ultrapure water. Eluted peptides were dried by vacuum centrifugation and stored at −20˚C.

Reverse Phase LC–MS

ZIC-HILIC enriched samples were re-suspended in Buffer A* and separated using a two-column chromatography set up composed of a PepMap100 C18 20 mm × 75 μm trap and a PepMap C18 500 mm × 75 μm analytical column (Thermo Fisher Scientific). Samples were concentrated onto the trap column at 5 μL/min for 5 min with Buffer A (0.1% formic acid, 2% DMSO) and then infused into an Orbitrap Fusion™ Lumos™ Tribrid™ Mass Spectrometer (Thermo Fisher Scientific) at 300 nl/minute via the analytical column using a Dionex Ultimate 3000 UPLC (Thermo Fisher Scientific). 185-min analytical runs were undertaken by altering the buffer composition from 2% Buffer B (0.1% formic acid, 77.9% acetonitrile, 2% DMSO) to 28% B over 150 min, then from 28% B to 40% B over 10 min, then from 40% B to 100% B over 2 min The composition was held at 100% B for 3 min, and then dropped to 2% B over 5 min before being held at 2% B for another 15 min The Lumos™ Mass Spectrometer was operated in a data-dependent mode automatically switching between the acquisition of a single Orbitrap MS scan (120,000 resolution) every 3 s and Orbitrap MS/MS HCD scans of precursors (NCE 30%, maximal injection time of 80 ms, AGC 1*105 with a resolution of 15,000). HexNAc oxonium ion (204.087 m/z) product-dependent MS/MS analysis (56) was used to trigger three additional scans of potential glycopeptides; a Orbitrap EThcD scan (NCE 15%, maximal injection time of 250 ms, AGC 2*105 with a resolution of 30,000); a ion trap CID scan (NCE 35%, maximal injection time of 40 ms, AGC 5*104) and a stepped collision energy HCD scan (using NCE 32%, 40%, 48% for N-linked glycopeptide samples and NCE 28%, 38%, 48% for O-linked glycopeptide samples with a maximal injection time of 250 ms, AGC 2*105 with a resolution of 30,000). For B. pseudomallei K96243 glycopeptide enrichments, duplicate runs were undertaken as above with the Orbitrap EThcD scans modified to use the extended mass range setting (200 m/z to 3000 m/z) to improve the detection of high mass glycopeptide fragment ions (61).

Data Analysis

Raw data files were batch processed using Byonic v3.5.3 (Protein Metrics Inc (14)) with the proteome databases denoted within Table I. Data were searched on a desktop with two 3.00 GHz Intel Xeon Gold 6148 processors, a 2TB SDD and 128 GB of RAM using a maximum of 16 cores for a given search. For all searches, a semi-tryptic N-ragged specificity was set and a maximum of two missed cleavage events allowed. Carbamidomethyl was set as a fixed modification of cystine whereas oxidation of methionine was included as a variable modification. A maximum mass precursor tolerance of 5 ppm was allowed whereas a mass tolerance of up to 10 ppm was set for HCD fragments and 20 ppm for EThcD fragments. For open searches of C. fetus fetus samples (N-linked glycosylation), the wildcard parameter was enabled allowing a delta mass between 200 Da and 1600 Da on asparagine residues. For open searches of O-linked glycosylation samples, the wildcard parameter was enabled allowing a delta mass between 200 Da and 2000 Da on serine and threonine residues. For focused searches, all parameters listed above remained constant except wildcard searching which was disabled and specific glycoforms as identified from open searches included as variable modifications. To ensure high data quality, separate data sets from the same biological samples were combined using R and only glycopeptides with a Byonic score >300 were used for further analysis. This score cutoff is in line with previous reports highlighting that score thresholds greater than at least 150 are required for robust glycopeptide assignments with Byonic (44, 57). It should be noted that a score threshold of above 300 resulted in false discovery rates of less than 1% for all combined data sets. Pearson correlation analysis of delta mass profiles was undertaken using Perseus (58). Data visualization was undertaken using ggplot2 within R with all scripts included in the PRIDE uploaded data sets.

Experimental Design and Statistical Rationale

For each bacterial strain examined three biological replicates were prepared and used for glycopeptide enrichments leading to three LC–MS runs per bacterial strain. Three separate enrichments were prepared and run to ensure an accurate representation of the observable glycoproteome. B. pseudomallei K96243 biological replicates were run twice with two different instrument methods to improve the characterization of the 990 Da glycan. For C. fetus fetus NCTC 10842 unenriched peptide samples were run with identical methods as those used for glycopeptide analysis to assess the presence of formylated glycans before enrichment.

Open Database Searching Allows the Identification of Bacterial N-Linked Glycopeptides

Although open database searching enables the detection of a variety of modifications including eukaryotic glycosylation (46, 47), to our knowledge, it has not been applied to bacterial systems or the study of atypical forms of glycosylation. To enable the identification of glycopeptides with complex glycans, large delta mass windows are needed as even modest glycans (>three monosaccharides) would be larger than the 500 Da window typically used for open searching (39, 40, 41, 42, 43). To enable wide mass searching we used Byonic, which uses a peak look up approach for the identification of PSMs (14, 51). This approach is both rapid and overcomes the combinatorial explosion which leads to long search times with database open searching (48, 51). To assess the ability of Byonic based open searching to identify bacterial glycopeptides, we first examined glycopeptide enrichments of C. fetus fetus NCTC 10842. C. fetus fetus possesses a small proteome (∼1600 proteins (61)) and is known to produce two N-linked glycans composed of β-GlcNAc-α1,3-[GlcNAc1,6-]GlcNAc-α1,4-GalNAc-α1,4-GalNAc-α1,3-diNAcBac (1243.507 Da) and β-GlcNAc-α1,3-[Glc1,6-]GlcNAc-α1,4-GalNAc-α1,4-GalNAc-α1,4-diNAcBac (1202.481 Da) where diNAcBac is the bacterial specific sugar 2,4-diacetamido-2,4,6 trideoxyglucopyranose (25). We searched ZIC-HILIC glycopeptide enrichments of C. fetus fetus allowing a wildcard mass of 200 Da to 1600 Da on asparagine enabling the processing of 3 h LC–MS runs by open searching in less than 2 h (supplemental Fig. S1A). Examination of the detected modifications by binning the observed delta masses in 0.001 Da increments demonstrated a clear cluster of modifications with masses >1000Da (Fig. 1A). Within these masses 1242.501 Da and 1201.475 Da were the most numerous delta masses observed (Fig. 1A, supplemental Table S1) yet these are one Da off the expected glycoforms of C. fetus fetus (25). Close examination of the observed delta masses reveals evidence of mis-assignments of the mono-isotopic masses by the appearance of satellite peaks (42, 48) differing by exactly one Da (supplemental Fig. S2A). The mis-assignment of the mono-isotopic peak of large glycopeptides has been previously noted (62) and within Byonic is combated by allowing isotope re-assignment denoted as the “off-by-x” parameter. Examination of the “off-by-x” masses supports the inappropriate mass correction of the 1243/1202 Da glycans to the observed 1242/1201 delta masses (supplemental Fig. S2B). These results support that the 1243/1202 Da glycans are readily detected in C. fetus fetus samples using open database searching despite the splitting of the delta mass observations across multiple masses because of errors in mono-isotope assignments.

Fig. 1

Open searching analysis of A, C. fetus fetus glycopeptide delta mass plot of 0.001 Da increments showing the detection of PSMs modified with masses over 1000 Da. B, Zoomed view of the C. fetus fetus glycopeptide delta mass plot highlighting the most numerous observed delta masses; red masses correspond to nonformylated glycans whereas black masses correspond to formylated glycans. C, Density and zoomed delta mass plot of the C. fetus fetus glycan masses 1243.507 Da and 1202.481 Da. D, Comparison of the unique glycopeptide sequences and glycoproteins observed between open and focused searches across C. fetus fetus data sets. Surprisingly, our open search also revealed additional glycoforms corresponding to formylated glycans (+27.99 Da) as well as a modification corresponding to the loss of a HexNAc (−203.079 Da) or Hex (−162.053 Da) from the 1243 Da or 1202 Da glycans respectively (Fig. 1B). MS/MS analysis supports these delta masses as unexpected but bona fide glycoforms (supplemental Fig. S3A to S3J). Formylated glycans have been previously observed (25, 28) during ZIC-HILIC enrichment and are most likely artifacts because of the high concentrations of formic acid (44) used during enrichment. These formylated glycans represented a significant proportion of all potential glycopeptide PSMs (Fig. 1B, supplemental Table S1). Consistent with glycan formylation being artifactual, it is not observed on C. fetus fetus glycopeptides within unenriched samples (supplemental Fig. S4). To assess the accuracy of the glycan masses obtained using open searching, we extracted the mean delta mass of the 1243 and 1202 Da glycans using a density based fitting approach (63) (Fig. 1C and 1D). We find the open search defined mass of the 1243 Da and 1202 Da glycans are both within 5 ppm of the known masses (25) supporting that this approach allows high accuracy determination of large modifications. Finally, we assessed the proteome coverage of our open database approach to a traditional search using the seven identified glycoforms (1040.423 Da, 1068.419 Da, 1202.475 Da, 1230.469 Da, 1243.501 Da, 1271.497 Da, 1299.492 Da, Fig. 1B) as a focused search (64). Consistent with previous studies focused searches outperformed open database searches (39, 46, 47) improving the identification of unique glycopeptides by 35% and glycoproteins by 28% (Fig. 1D, supplemental Table S2). This improvement was also associated with an increase in the average Byonic score from 456 to 491 for glycopeptide PSMs with identical MS/MS scans receiving an average 114 Byonic score increase between focused and open search assignments (supplemental Fig. S5). These focused searches also demonstrated that formylated glycans account for nearly a 1/3 of all glycopeptide assigned PSMs (>1246 formylated glycan PSMs out of the total 3824 glycopeptide PSMs, supplemental Table S2). Combined, these results demonstrate open searching allows the detection of heterogeneous bacterial N-linked glycopeptides without the need to define glycans before searching.

Open Database Searching Allows the Identification of Bacterial O-Linked Glycopeptides

To assess open searching's compatibility with bacterial O-linked glycopeptides, we examined glycopeptide enrichments of A. baumannii ATCC17978. The A. baumannii proteome is twice the size of C. fetus fetus (∼3600 proteins (65)) with glycosylation of both serine and threonine residues reported to date (52). Within this system, glycoproteins are modified predominantly with the glycan GlcNAc3NAcA4OAc-4-(β-GlcNAc-6-)-α-Gal-6-β-Glc-3-β-GalNAc where GlcNAc3NAcA4OAc corresponds to the bacterial sugar 2,3-diacetamido-2,3-dideoxy-α-d-glucuronic acid (glycan mass 1030.368 Da (52)). Importantly, this terminal glucuronic acid can also be found in methylated as well as un-acetylated states (corresponding to the glycan masses 1044.383 Da and 988.357 Da respectively (26, 52)). A. baumannii glycopeptide enrichments were searched allowing a wildcard mass of 200 Da to 2000 Da on serine and threonine residues. The increased complexity of this search, both in terms of the number of amino acids potentially modified as well as the size of the proteome, resulted in a marked increase in the search times per data files to ∼10 h (supplemental Fig. S1B). Within these samples, open searching readily enabled the identification of multiple delta masses of similar sizes to the expected glycans of A. baumannii ATCC17978 as well as the unexpected glycoforms of 827.281 and 1058.358 Da (Fig. 2A, supplemental Table S3). These novel glycan masses are consistent with formylation (+27.99 Da) as well as the loss of HexNAc (-203.079 Da) from the 1030 Da glycan with MS/MS analysis supporting these assignments (supplemental Fig. S6).

Fig. 2

Open searching analysis of A, A. baumannii glycopeptide delta mass plot of 0.001 Da increments showing the detection of PSMs modified with masses over 800 Da. B, Zoomed view of A. baumannii glycopeptide delta mass plot highlighting the most numerous observed modifications. C, Comparison of unique glycopeptide sequences and glycoproteins observed between open and focused searches across A. baumannii data sets D, Glycan mass plot showing the amount of glycan (in Da) observed on glycopeptides PSMs within the focused searches. E, Venn diagram showing the number of unique glycopeptide sequences grouped based on the number of glycans observed on these peptides. Examination of these masses revealed the most numerous delta masses (1029.362 Da, 987.355 Da and 1043.378 Da) were one Da less than the expected A. baumannii glycan masses (Fig. 2B (26, 52)). As with C. fetus fetus, inspection of these assignments reveals the incorrect application of the “off-by-x” parameter leading to the splitting of delta masses across multiple mass assignments separated by one Da (supplemental Fig. S7). Using the masses 1030.368 Da, 988.357 Da, 1044.383 Da, 827.281 Da and 1058.358 Da, we researched these A. baumannii data sets to assess the performance of open to focused searching. In contrast to the ∼35% increase in unique glycopeptides observed between open and focused searches in C. fetus fetus we noted a dramatic >240% improvement in unique glycopeptides identified within A. baumannii using focused searches (Fig. 2C). To understand this dramatic improvement, we examined the 67 glycopeptides unique to the focused searches. Within these glycopeptides we noted a large proportion of PSMs corresponding to glycopeptides modified with multiple glycans (Fig. 2D, supplemental Table S4). In fact, >20% (494 out of the total 2282 glycopeptide PSMs) corresponded to glycopeptides with greater than one glycan attached. Within these PSMs, 31 unique peptide sequences are only observed with >1 glycan attached (Fig. 2E). The increased numbers of unique glycopeptides identified within focused searches were also associated with an increase in the mean Byonic scores as well as identical MS/MS scans receiving an average 149 Byonic score increase between focused and open search assignments (supplemental Fig. S8). Similar to C. fetus fetus, a large proportion of glycopeptide PSMs were identified with formylated glycans (>309 out of the total 2282 glycopeptide PSMs, supplemental Table S4). It is important to note that the delta masses of multiply glycosylated peptides fall outside the 2000 Da window used for open searching making the inability to detect these glycopeptides an expected limitation of the search parameters. Thus, although open searching enables the rapid identification of glycopeptides, large glycans/multiply glycosylated peptides can be overlooked supporting the value of a two-step (open followed by focused) searching approach.

Open Database Searching Enables the Identification of Glycosylation within Large Proteomes

As open searching enabled the identification of both N and O-linked glycopeptides, we sought to explore the compatibility of this approach with larger proteomes using glycopeptide enriched samples from the bacteria B. Cenocepacia J2315 as a model. The B. Cenocepacia proteome encodes ∼7000 proteins (66) and possesses an O-linked glycosylation system responsible for modifying at least 23 proteins (37). Previously, we showed that this glycosylation system transfers two glycans composed of β-Gal-(1,3)-α-GalNAc-(1,3)-β-GalNAc and Suc-β-Gal-(1,3)-α-GalNAc-(1,3)-β-GalNAc where Suc is Succinyl with these glycans corresponding to the masses 568.211 Da and 668.228 Da respectively (23, 37). As with A. baumannii, the increased complexity of this proteome led to an increase in the search time with individual data files taking ∼20 h to process (supplemental Fig. S1C). These open searches revealed the presence of the expected glycoforms of B. cenocepacia (568.207 Da and 668.223 Da) as well as additional formylated variants (596.202 Da, 624.197, and 696.218 Da) leading to the identification of five unique glycoforms (Fig. 3A, supplemental Table S5). Unlike the large glycans of C. fetus fetus and A. baumannii, it is notable that the mono-isotopic mass of the known B. Cenocepacia glycans (37) were correctly assigned (Fig. 3A). Thus, this supports that for smaller glycans mis-assignment of the mono-isotopic masses during open searches does not appear as problematic.

Fig. 3

Open searching analysis of A, B. Cenocepacia glycopeptide delta mass plot of 0.001 Da increments showing the detection of PSMs modified with masses over 500 Da. Highlighted area shown in zoomed panel. B, Comparison of the unique glycopeptide sequences and glycoproteins observed between open and focused searches across B. Cenocepacia data sets C, Glycan mass plot showing the amount of glycan (in Da) observed on glycopeptides PSMs within the focused searches. Nearly 40% of all glycopeptide PSMs are decorated with two or more glycans. Incorporating these glycoforms into focused searches again led to a dramatic ∼4-fold increase in the number of glycopeptides and a ∼2-fold increase in the total number glycoproteins identified compared with open searches (Fig. 3B, supplemental Table S6). Although the improvement in the total number of identifications was associated with a decrease in the mean Byonic score (from 728 to 700) the assigned scores of identical MS/MS scans reveals focused searches lead to an increase in the average Byonic score of 175 compared with assignments from open searches (supplemental Fig. S9). As the dramatic improvement in the glycoproteome coverage of A. baumannii was partially driven by the detection of multiply glycosylated peptides we examined the amount of glycosylation within glycopeptide PSMs in B. Cenocepacia. As B. Cenocepacia glycopeptides modified with multiple glycans would be less than 2000 Da, we were surprised by the limited number of multiply glycosylated peptides identified within our open searches (<10% of all identified glycopeptides, supplemental Fig. S10). In contrast, focused searches identified ∼40% of all PSMs (Fig. 3C, 1508 out of 3937 identified glycopeptide PSMs) corresponded to multiply modified peptides. As with C. fetus fetus and A. baumannii formylated glycans make up nearly 50% of all glycopeptide PSMs (Fig. 3C, supplemental Table S6). These data supports that, although open searching performs well for singly modified peptides, this approach appears to underrepresent multiply glycosylated peptides even if the combined mass of the glycan is within the range of the open search.

Open Database Searching Allows the Screening of Glycan Use across Biological Samples

Having established that open searching enables the identification of a range of glycans, we sought to explore if this could also facilitate the comparison of glycan diversity across bacterial samples. Recently, we reported that a single-loci was responsible for the generation of the O-linked glycans in B. Cenocepacia and that this loci is conserved across the Burkholderia (23). Although these results support that Burkholderia species use similar glycans, it has been noted that within other bacterial genera extensive glycan heterogeneity exists (25, 26, 29, 32). As glycan heterogeneity can be challenging to predict, we reasoned that open searching would provide a means to assess the similarities in glycans used across Burkholderia species. We examined glycopeptide enrichments from eight species of Burkholderia (B. pseudomallei K96243; B. multivorans MSMB2008; B. dolosa AU0158; B. humptydooensis MSMB43; B. ubonensis MSMB22, B. anthina MSMB649; B. diffusa MSMB375; and B. pseudomultivorans MSMB2199). Examination of the delta masses observed across these eight species demonstrates that the 568 Da and 668 Da glycoforms as well as their formylated variants are present in all strains (Fig. 4A and supplemental Fig. S11, supplemental Tables S7 to S14). Having generated “delta mass fingerprints” for each species, we assessed if these profiles could enable the comparison of glycan use using Pearson correlation and hierarchical clustering (Fig. 4B and supplemental Fig. S12). Consistent with the similarities in the delta mass fingerprints Pearson correlation and hierarchical clustering resulted in the grouping of all Burkholderia species compared with the delta mass fingerprints of C. fetus fetus and A. baumannii (Fig. 4B). These results support that consistent with the conservation of the glycosylation loci within Burkholderia, the major glycoforms observed within Burkholderia species, based on mass at least, are identical. It should be noted that as with the above glycopeptide data sets, focused searches significantly improved the identification of glycopeptides and glycoproteins in all Burkholderia species (supplemental Fig. S13, supplemental Table S15 to S22).

Fig. 4

Comparison of Burkholderia glycoproteomes using open searching.A, Representative delta mass plots of four out of the eight Burkholderia strains examined demonstrating the 568 Da and 668 Da glycans are frequently identified delta masses in Burkholderia glycopeptide enrichments. Formylated glycans are denoted in black whereas Burkholderia O-linked glycans are in red. B, Pearson correlation and clustering analysis of delta mass plots enable the comparison and grouping of samples.

Open Database Searching Allows the Detection of Glycoforms Identified at a Low Frequency Based on Known Glycosylatable Peptides

In addition to allowing the comparison of glycan diversity across species, we reasoned that open searching would also allow the identification of novel glycans based on the shared use of glycosylatable peptide sequences. Within bacterial glycosylation studies, proteins compatible with different glycosylation machinery are routinely used to “fish” out glycans used for protein glycosylation in different bacterial species (26, 32). Similarly, by focusing on peptides modified with the 568/668 Da glycans we hypothesized this would provide the means to detect alternative glycans used for glycosylation within Burkholderia species. To assess this, we examined the glycopeptide enrichments of B. pseudomallei K96243 filtering for delta masses only observed on peptide sequences also modified with either the 568/668 Da glycans (Fig. 5A). Examination of these delta masses readily revealed the presence of PSMs matching the modification of peptides with single (203.077 Da) or double (406.158 Da) HexNAc residues, two 568 Da glycans (1136.422 Da) and an unexpected mass at 990.390 Da (Fig. 5A). Examination of PSMs assigned to this 990 Da delta mass revealed a linear glycan composed of HexNAc-Heptose-Heptose-188-215 where the 188 Da and 215 Da are moieties of unknown composition (Fig. 5B). Incorporation of this unexpected glycan mass into a focused search with the known Burkholderia glycans demonstrated that the 990.390 Da glycan is observed on multiple peptide substrates yet less than 6% of all glycopeptide PSMs correspond to this novel glycan (Fig. 5C). Thus, this demonstrates that open searching provides an effective means to detect unexpected glycoforms which could be overlooked because of the low frequency of their occurrence in glycoproteomic data sets.

Fig. 5

Identification of minor glycoforms within A, Delta mass plot, binned by 0.001 Da increments, showing delta masses observed for peptide sequences also modified with the 568 or 668D glycans. B, MS/MS analysis (FTMS-HCD, ITMS-CID and FTMS-EThcD) supporting the assignment of a linear glycan of HexNAc-Heptose-Heptose-188-215 attached to the peptide KAATAAPADAASQ. C, Glycan mass plot showing the amount of glycan (in Da) observed on glycopeptides PSMs within focused searches. Only ∼6% of all PSMs observed are modified with the 990 Da glycan.

DISCUSSION

MS analysis of glycoproteomic samples typically requires knowledge of both the proteome and possible glycan compositions to facilitate software-based identification (13). As bacterial glycosylation systems do not use glycans found in eukaryotic glycan databases (23, 24, 25, 26, 27, 28, 29, 30, 31, 32), we sought to establish an alternative approach for the high-throughput analysis of bacterial glycoproteomes. Within this work we demonstrate that wide mass open database searching enables the identification of bacterial glycosylation without the need to define glycan masses before searching. This approach overcomes a significant bottleneck in the identification and characterization of novel bacterial glycosylation systems. We demonstrate that a range of diverse glycan structures, both reported (25, 26, 37, 52) and unreported, such as the 990 Da glycan observed in B. pseudomallei K96243, as well as glycan artifacts such as formylated glycans can be identified using this approach. In addition, we also demonstrate that open database searches can be used to provide a simple means to compare delta mass profiles across biological samples. This enables a straightforward method to compare and contrast bacterial glycoproteomes, enabling the grouping of Burkholderia profiles from nonsimilar glycan profiles such as those seen in C. fetus fetus or A. baumannii. Within this work we used open searching within Byonic, a widely used tool in the glycoproteomic community for the analysis of glycosylation (57, 67, 68). This enabled us to directly compare the performance of open searches to focused glycopeptide searches within the same platform. We observed a marked improvement in glycopeptide and glycoprotein identifications within focused searches, especially for glycopeptides modified with multiple glycans. As several unique considerations are needed for optimal glycopeptide identification, such as accounting for glycan fragments (13, 69), this improvement in performance is unsurprising. Consistent with this we observed an increase in the Byonic scores within focused searches compared with open searches for most data sets (supplemental Figs. S5, S8, and S9). This improvement translates to an increase in the numbers of unique glycopeptides and glycoproteins identified by ∼35% to 240%. This is in line with previous studies (46, 47) and the observation that nonoptimized glycopeptide analysis can lead to a reduction in glycoproteome coverage (64). Although Byonic was used within this study it should be noted alternative noncommercial platforms such as MSfragger (43) also allow open searching. In our hands MSfragger performed comparably to Byonic for the identification of glycoforms using open searches (supplemental Figs. S14A to S14F). Yet, as with our open Byonic searches MSfragger did not identify as many unique glycopeptides/glycoproteins as focused Byonic searches (supplemental Figs. S15A to S15F). These results demonstrate that open searching can be used to identify glycopeptides, yet because of the unique challenges associated with glycopeptide identification open searches can be less sensitive than focused searches. Although we have used open searching with masses up to 2000 Da wider mass ranges can and have been used to identify glycopeptides. In fact, for eukaryotic glycosylation analysis open searches with +3000 Da have been demonstrated enabling the identification of glycans of up to ∼2000 Da in size (46, 47). Although multiple open searching tools allow the maximum delta mass range to be widened this can be at the expense of search times (48). It should be noted for open searching there is likely an upper limit for the maximum delta mass which can be used before no additional high confidence assignments are gained. This maximum limit is not only determined by the analytes being examined but also the MS/MS acquisition parameters. For example, for glycopeptides as the amount of glycan decorating peptides increases the optimal amounts of collision energy and ETD reaction times begin to diverge from standard parameters (70, 71). Thus, even using open searching some specific subsets of peptides may still be unidentifiable without careful tailoring of the data acquisition methods. At its core, this analytical approach uses a “strength in numbers” based strategy for the detection of glycans. A key strength of this approach is that it does not require the identification of unmodified versions of a peptide for the modified forms to be identified as required in dependent peptide-based approaches (49, 50). This independence of the need for unmodified peptides makes this approach compatible with enrichment strategies such as ZIC-HILIC glycopeptide enrichment. This is important as for optimal performance this approach requires large numbers of PSMs with identical delta masses. It should be noted that within ZIC-HILIC enrichments bacterial glycopeptides are still only a minor proportion of observed peptides being <10% of the identified PSMs (supplemental Fig. S16). Within this work we focused on bacterial glycosylation systems known to target multiple protein substrates (16, 18), ensuring large numbers of unique PSMs/peptide sequences would be identified. We found that, within glycopeptide enrichments, the known glycoforms of C. fetus fetus, A. baumannii and B. cenocepacia were easily detected whereas infrequently observed glycans, such as the 990 Da glycan observed within B. pseudomallei, required additional filtering to distinguish this from background signals. This supports that although open searching enables the detection of glycoforms, it is sensitive to the frequency at which modification events are observed within data sets. Although we used filtering based on glycosylatable peptide sequences to identify low frequency events, recently Kernel density estimation based fitting approaches were shown to effectively address this issue in a more general manner (63). Thus, open database searching provides multiple approaches to identify modifications even those which are poorly resolved from background. A surprising observation within our open searches was the commonality of glycan formylation events within enriched glycopeptide samples. Recently it was shown that formic acid concentrations as low as 0.1% could lead to peptide formylation (44). For the enrichment of bacterial glycopeptides ZIC-HILIC enrichment with 5% formic acid/80% acetonitrile has been extensively used (24, 26, 27, 28, 36, 37, 52, 72) yet the observation of formylation artifacts on a substantial number of glycopeptide PSMs (supplemental Tables S2, S4, S6, S15 to S22) raises concerns about this protocol. Multiple reagents including Tris (73) and alkaline solutions such as ammonium hydroxide/sodium hydroxide (74) can lead to alterations within glycan structures necessitating the judicious use of these chemicals within glycan/glycopeptide sample preparation protocols. Although 5% formic acid improves the selectivity for glycopeptides within ZIC-HILIC enrichments (75) alternative acids can also be used. Previously Mysling et al. demonstrated that formic acid could be substituted for TFA to improve the enrichment of glycopeptides (76), whereas Ding et al. noted that hydrochloric acid was an effective ion-pairing agent for normal phase enrichment of bacterial glycopeptides (77). Thus, the observation of formylation highlights that alternative buffers should be considered for future glycopeptide studies and that formylated glycans need to be considered when analyzing glycopeptide data sets where formic acid has been used during glycopeptide enrichments. Although open searching enabled the identification of glycosylation within all bacterial samples, the analysis of C. fetus fetus and A. baumannii data sets highlighted the commonality at which mono-isotopic masses of glycopeptides with large glycans (>1000 Da) are mis-assigned. This problem has been highlighted previously (62) and is not unique to glycopeptides with the mono-isotopic mass of other large biomolecules such as cross-linked peptide shown to be mis-assigned in 50 to 75% of PSMs (78). This mis-assignment of mono-isotopic masses leads to the splitting of the number of observed PSMs with a specific glycan mass across multiple mass channels. Although we demonstrate that these mis-assigned glycopeptides can be readily identified by examining the “off-by-x” parameter, it should be noted that this splitting dilutes the observable glycopeptides at a specific mass, complicating the analysis of glycoproteomes from open searches. This complication, coupled with the lower sensitivity of glycopeptide identification with open searching compared with focused searches discussed above, supports that open searching is a useful discovery tool yet typically under reports unique glycopeptides and glycoproteins within data sets. The simplest solution to this issue is to use open searching as a means to identify glycans which can then be included as variables modifications within a focused search. As highlighted above, this significantly improves glycoproteome coverage and in our hands provided the flexibility of being able to detect novel glycans yet also ensured optimal identification of glycopeptides. Automated pipelines using multi-step searching have already been demonstrated (42, 48) yet to our knowledge these have not been optimized or implemented for glycopeptide identification. Thus, we recommend a multi-step analysis to enable the identification of atypical glycosylation, using open searching to define glycans which are then incorporated into focused searches. Finally, it should be noted that although not the subject of this manuscript, the glycoproteins/glycopeptides identified in this work are themselves a useful resource for the bacterial glycosylation community. Previous studies on C. fetus fetus, A. baumannii and B. cenocepacia identified a total of 26, 26, and 23 unique glycoproteins respectively (25, 26, 37) yet the majority of these studies were undertaken on previous generations of MS instrumentation. Within this work, undertaken on a current generation instrument, we observed a marked improvement in the number of glycoproteins identified with 61 (2.3-fold), 53 (2.0-fold) and 125 (5.4-fold) glycoproteins identified in C. fetus fetus, A. baumannii and B. cenocepacia respectively. Similarly, our glycoproteomic analysis of the 8 Burkholderia species highlights that at least 70 proteins are glycosylated within each Burkholderia species. Taken together, this work highlights that the glycoproteomes of most bacterial species are likely far larger than earlier studies suggested with open searching providing an accessible starting point to probe these systems.

DATA AVAILABILITY

All MS proteomics data (Raw data files, Byonic search outputs, R Scripts and output tables) have been deposited into the PRIDE ProteomeXchange Consortium repository (59, 60) with the data set identifier: PXD018587.

82 in total

Review 1. Analysis of Mammalian O-Glycopeptides-We Have Made a Good Start, but There is a Long Way to Go.

Authors: Zsuzsanna Darula; Katalin F Medzihradszky
Journal: Mol Cell Proteomics Date: 2017-11-21 Impact factor: 5.911

2. A general protein O-glycosylation system within the Burkholderia cepacia complex is involved in motility and virulence.

Authors: Karen V Lithgow; Nichollas E Scott; Jeremy A Iwashkiw; Euan L S Thomson; Leonard J Foster; Mario F Feldman; Jonathan J Dennis
Journal: Mol Microbiol Date: 2014-03-05 Impact factor: 3.501

3. Proteomics Reveals Multiple Phenotypes Associated with N-linked Glycosylation in Campylobacter jejuni.

Authors: Joel A Cain; Ashleigh L Dale; Paula Niewold; William P Klare; Lok Man; Melanie Y White; Nichollas E Scott; Stuart J Cordwell
Journal: Mol Cell Proteomics Date: 2019-01-07 Impact factor: 5.911

4. Toward Automated N-Glycopeptide Identification in Glycoproteomics.

Authors: Ling Y Lee; Edward S X Moh; Benjamin L Parker; Marshall Bern; Nicolle H Packer; Morten Thaysen-Andersen
Journal: J Proteome Res Date: 2016-09-02 Impact factor: 4.466

5. Identification and quantification of glycoproteins using ion-pairing normal-phase liquid chromatography and mass spectrometry.

Authors: Wen Ding; Harald Nothaft; Christine M Szymanski; John Kelly
Journal: Mol Cell Proteomics Date: 2009-06-12 Impact factor: 5.911

6. Identification of a general O-linked protein glycosylation system in Acinetobacter baumannii and its role in virulence and biofilm formation.

Authors: Jeremy A Iwashkiw; Andrea Seper; Brent S Weber; Nichollas E Scott; Evgeny Vinogradov; Chad Stratilo; Bela Reiz; Stuart J Cordwell; Randy Whittal; Stefan Schild; Mario F Feldman
Journal: PLoS Pathog Date: 2012-06-07 Impact factor: 6.823

7. Complete genome sequences for 59 burkholderia isolates, both pathogenic and near neighbor.

Authors: Shannon L Johnson; Kimberly A Bishop-Lilly; Jason T Ladner; Hajnalka E Daligault; Karen W Davenport; James Jaissle; Kenneth G Frey; Galina I Koroleva; David C Bruce; Susan R Coyne; Stacey M Broomall; Po-E Li; Hazuki Teshima; Henry S Gibbons; Gustavo F Palacios; C Nicole Rosenzweig; Cassie L Redden; Yan Xu; Timothy D Minogue; Patrick S Chain
Journal: Genome Announc Date: 2015-04-30

8. MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics.

Authors: Andy T Kong; Felipe V Leprevost; Dmitry M Avtonomov; Dattatreya Mellacheruvu; Alexey I Nesvizhskii
Journal: Nat Methods Date: 2017-04-10 Impact factor: 28.547

9. The PRIDE database and related tools and resources in 2019: improving support for quantification data.

Authors: Yasset Perez-Riverol; Attila Csordas; Jingwen Bai; Manuel Bernal-Llinares; Suresh Hewapathirana; Deepti J Kundu; Avinash Inuganti; Johannes Griss; Gerhard Mayer; Martin Eisenacher; Enrique Pérez; Julian Uszkoreit; Julianus Pfeuffer; Timo Sachsenberg; Sule Yilmaz; Shivani Tiwary; Jürgen Cox; Enrique Audain; Mathias Walzer; Andrew F Jarnuczak; Tobias Ternent; Alvis Brazma; Juan Antonio Vizcaíno
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

10. 2016 update of the PRIDE database and its related tools.

Authors: Juan Antonio Vizcaíno; Attila Csordas; Noemi del-Toro; José A Dianes; Johannes Griss; Ilias Lavidas; Gerhard Mayer; Yasset Perez-Riverol; Florian Reisinger; Tobias Ternent; Qing-Wei Xu; Rui Wang; Henning Hermjakob
Journal: Nucleic Acids Res Date: 2015-11-02 Impact factor: 16.971

6 in total

1. Discovery of Unknown Posttranslational Modifications by Top-Down Mass Spectrometry.

Authors: Jesse W Wilson; Mowei Zhou
Journal: Methods Mol Biol Date: 2022

Review 2. Measuring change in glycoprotein structure.

Authors: Mary Rachel Nalehua; Joseph Zaia
Journal: Curr Opin Struct Biol Date: 2022-04-19 Impact factor: 7.786

3. Proteomic Analyses of Acinetobacter baumannii Clinical Isolates to Identify Drug Resistant Mechanism.

Authors: Ping Wang; Ren-Qing Li; Lei Wang; Wen-Tao Yang; Qing-Hua Zou; Di Xiao
Journal: Front Cell Infect Microbiol Date: 2021-02-24 Impact factor: 5.293

4. Burkholderia PglL enzymes are Serine preferring oligosaccharyltransferases which target conserved proteins across the Burkholderia genus.

Authors: Andrew J Hayes; Jessica M Lewis; Mark R Davies; Nichollas E Scott
Journal: Commun Biol Date: 2021-09-07

5. Glyco-Decipher enables glycan database-independent peptide matching and in-depth characterization of site-specific N-glycosylation.

Authors: Zheng Fang; Hongqiang Qin; Jiawei Mao; Zhongyu Wang; Na Zhang; Yan Wang; Luyao Liu; Yongzhan Nie; Mingming Dong; Mingliang Ye
Journal: Nat Commun Date: 2022-04-07 Impact factor: 14.919

Review 6. A Pragmatic Guide to Enrichment Strategies for Mass Spectrometry-Based Glycoproteomics.

Authors: Nicholas M Riley; Carolyn R Bertozzi; Sharon J Pitteri
Journal: Mol Cell Proteomics Date: 2020-12-20 Impact factor: 5.911

6 in total