Literature DB >> 36135853

ssPINE: Probabilistic Algorithm for Automated Chemical Shift Assignment of Solid-State NMR Data from Complex Protein Systems.

Adilakshmi Dwarasala¹, Mehdi Rahimi¹, John L Markley^1,2, Woonghee Lee¹.

Abstract

The heightened dipolar interactions in solids render solid-state NMR (ssNMR) spectra more difficult to interpret than solution NMR spectra. On the other hand, ssNMR does not suffer from severe molecular weight limitations like solution NMR. In recent years, ssNMR has undergone rapid technological developments that have enabled structure-function studies of increasingly larger biomolecules, including membrane proteins. Current methodology includes stable isotope labeling schemes, non-uniform sampling with spectral reconstruction, faster magic angle spinning, and innovative pulse sequences that capture different types of interactions among spins. However, computational tools for the analysis of complex ssNMR data from membrane proteins and other challenging protein systems have lagged behind those for solution NMR. Before a structure can be determined, thousands of signals from individual types of multidimensional ssNMR spectra of samples, which may have differing isotopic composition, must be recognized, correlated, categorized, and eventually assigned to atoms in the chemical structure. To address these tedious steps, we have developed an automated algorithm for ssNMR spectra called "ssPINE". The ssPINE software accepts the sequence of the protein plus peak lists from a variety of ssNMR experiments as inputs and offers automated backbone and side-chain assignments. The alpha version of ssPINE, which we describe here, is freely available through a web submission form.

Entities: Chemical

Keywords: MAS-NMR; assignment; automation; membrane proteins; solid-state NMR; ssPINE

Year: 2022 PMID： 36135853 PMCID： PMC9503581 DOI： 10.3390/membranes12090834

Source DB: PubMed Journal: Membranes (Basel) ISSN： 2077-0375

1. Introduction

NMR spectroscopy is one of the major biophysical methods, along with X-ray crystallography [1,2] and cryo-electron microscopy [3], for determining structures of biomolecules. NMR is used to study structure–function relationships of membrane proteins and large macromolecular assemblies [4] along with their interactions with small molecules [5] as an approach to drug discovery [6]. Both solution and solid-state NMR techniques provide important information about the structures and dynamics of membrane proteins [7,8]. Solid-state NMR (ssNMR) with magic angle spinning (MAS) has advantages over solution NMR for studies of large and immobilized proteins [9,10]. Anisotropic nuclear spin interaction information from ssNMR can be extremely useful for structure determination and dynamics [11,12]. The orientation of regions of membrane proteins can be extracted from ssNMR spectra of mechanically- or magnetically-aligned membranes [13]. The broad lines and low resolution of ssNMR spectra resulting from anisotropy can be overcome in part by ultra-high MAS, cross-polarization, refined pulse sequences [14], and non-uniform sampling (NUS). Ultra-high-field NMR spectrometers operating at 1.1 GHz and 1.2 GHz are improving the resolution and sensitivity of ssNMR spectra of membrane proteins and their complexes. The above-mentioned methods are enabling the collection of improved spectral data, but manual analysis of the data to obtain chemical shift assignments and structural constraints is tedious because thousands of signals need to be analyzed, correlated, and labeled. Software technology has reduced the burden of analyzing data from solution NMR studies of biomolecules. Available web-based resources provide automated and semi-automated algorithms for determining different parameters of biomolecules and their structure [15,16,17,18]. We recently developed an updated version of the assignment engine PINE [19], I-PINE (Integrative Probabilistic Interaction Network of Evidence) [20], which utilizes a Bayesian-based probabilistic interaction network. I-PINE supports a larger range of NMR experiments and integrates real-time statistical analysis of the PACSY database [21]. The I-PINE web server produces higher assignment coverage and accuracy than PINE and supports structure determinations based on chemical shift assignments. The POKY suite includes iPick [22], for peak picking and cross-validation of peaks from different spectra, I-PINE, and PINE-SPARKY.2 [23], a user-friendly graphical user interface (GUI) for submitting, importing, and validating the data [24]. For ssNMR data, PISA-SPARKY [25], a plugin for the assignment program, NMRFAM-SPARKY [26], supports the analysis of data from oriented samples [27]. PISA-SPARKY, along with its features, are now included in the POKY suite. Recently, the Veglia group introduced “a one-shot approach” called PHORONESIS, which generates up to ten 3D 1H-detected ssNMR spectra [28]. They used the I-PINE webserver to analyze the spectra, and found that the yield of sequential assignments was similar to that for solution NMR data. The Hunter Moseley and Chad Rienstra groups developed an ssNMR version of AutoAssign and demonstrated its ability to assign ssNMR data from the small protein, GB1 [29]. The software returned 84.1% correct assignments. The ssFLYA algorithm, which was introduced by Schmidt and colleagues [30], and is currently available for only commercial users, yielded 88–87% and 77–90% correctness on protein microcrystals and amyloids. Here, we describe ssPINE (solid-state PINE), a software package that is designed to handle the challenging features of ssNMR data from membrane proteins and other complex protein systems. ssPINE accepts, as inputs, 2D and 3D ssNMR data and gives, as an output, chemical shift assignments and their probabilistic correctness. We have evaluated the performance of ssPINE with data from GB1 and with additional protein NMR data from the BMRB database [31]. The alpha version of ssPINE is freely available through a web server utility at https://poky.clas.ucdenver.edu/ssPINE.

2. Materials and Methods

2.1. ssPINE Algorithm

As its first step, ssPINE generates spin system matrices [32], as shown in Figure 1. The main difference between the I-PINE and ssPINE algorithms is in their approach to comparing peaks from different experiments. I-PINE uses N and H in root experiments to find correlated signals (CA/CB/CO−1, CA/CB/CO) in different experiments and to establish di-peptide arrays {CA/CB/CO−1 N CA/CB/CO}. Then, it establishes a vector [CA−1, CB−1, CO−1, N, CA, CB, CO] and compares it to [CA−1, CB−1, CO−1] and [CA, CB, CO] from the other di-peptide arrays. Finally, I-PINE compares [CA−1, CB−1, CO−1] to [CA, CB, CO] in all vectors. By contrast, ssPINE uses CO−1, N, and CA in root experiments to find di-peptide signals (CX−1, CX). If an experiment is providing CO−1, N, and CA, but a single peak is not provided (e.g., CANCO), ssPINE combines information from different experiments, such as NCOCACB and NCACB, to obtain these correlations. Unlike I-PINE, ssPINE generates each spin system matrix by iterated spectral resolution steps (tolerances) using inter-residue connectivities optimized by a probability approach. This is a basic and important component of the ssNMR algorithm because it helps to overcome the variable spectral resolution of ssNMR spectra. ssPINE calculates the quality of the data at each step until it reaches the point where there is no further improvement in the spin system matrix. If the quality of the data is above a threshold value, as determined by the number of spin systems and correlations between spin systems is identified compared to the numbers expected, then the process continues to the pentapeptide generation step. Otherwise, the process terminates and informs the user that more information is required. The pentapeptide generation step, which assigns signals to atoms in sequences of five amino acid residues, finds the best marginal probabilities by using the belief propagation algorithm [33] to evaluate relations between spin systems. This step includes the identification of secondary structural elements, the evaluation of possible referencing errors, and the continued assignment of backbone spin systems until convergence is reached or, alternatively, until the specified number of iterations has occurred. The last step utilizes the Bayesian network model of PINE and I-PINE to assign side chain signals [16,20]. See the Supporting Information for a detailed description of the ssPINE algorithm.

Figure 1

Spin system matrix assembly in ssPINE. (a) Peaks from the strip of the NCOCACB experiment containing the CA(i − 1) and CB(i − 1) resonances are inserted in the row of the table corresponding to the ith residue (IDX). Note that the peak is selected from the CO(i − 1) and N(i), in root information. (b) Similarly, peaks from the strip of the NCACB experiment containing the CA(i) and CB(i) resonances are inserted in the row of the table corresponding to the ith residue. N(i) and CA(i) in root information are used to select a peak from NCACB. (c) The process is repeated for all residues in the peptide sequence.

2.2. Input Files

As with PINE and I-PINE, peak lists (either raw or refined) and sequence files are used as inputs to ssPINE. The supported solid-state NMR experiments and their profiles are shown in Table 1. The minimum set of peak lists for assignments are those from 2D-CC, 2D-NCA, 2D-NCO, 3D-NCACX, 3D-NCOCX, and CAN(CO)CX ssNMR experiments. Data from additional ssNMR experiments can be added to improve the accuracy and completeness of the results.

Table 1

ssNMR experiments supported by ssPINE with their dimensionality and connectivity profiles. CX(i) represents carbon A, B, D, E, G, or H atoms of the ith residue; N(i) represents the nitrogen atom of the ith residue; and CO(i − 1) represents the carbon atom of the carboxyl group of the preceding residue. The minimum set of experiments needed is indicated by asterisks.

Experiment	Dimension	Profile
CC *	2D	CX/O(i)-CX/O(i)
NCA *	2D	N(i)-CA(i)
NCACB	2D	N(i)-CA/B(i)
NCO *	2D	N(i)-CO(i − 1)
NCACO	3D	N(i)-CA(i)-CO(i)
NCACB	3D	N(i)-CA(i)-CA/B(i)
NCACX *	3D	N(i)-CA(i)-CX(i)
NCOCX *	3D	N(i)-CO(i − 1)-CX/C(i − 1)
NCOCA	3D	N(i)-CO(i − 1)-CA(i − 1)
NCOCACB	3D	N(i)-CO(i − 1)-CA/B(i − 1)
CANCO	3D	CA(i)-N(i)-CO(i − 1)
CANCOCX *	3D	CA(i)-N(i)-CX/O(i − 1)
CANCOCA	3D	CA(i)-N(i)-CA/O(i − 1)
CANCOCACB	3D	CA(i)-N(i)-CO/A/B(i − 1)

* Minimum experiments to run ssPINE.

2.2.1. Preparation of Peak Lists

Several peak list formats are accepted: Sparky (UCSF-/NMRFAM-SPARKY or POKY) with the .list file extension prepared in the peak list window (two-letter-code “lt” with the Data Heights option turned on), XEASY with the .peaks file extension [34], nmrDraw with the .ft2 file extension, NMRView with the .xpk file extension, and I-PINE with the .txt file extension. The file extension in the file name should match its actual format. Other programs can generate the Sparky format, which is one of the most common file formats in the field. For example, CARA has the WriteSparkyPeakList.lua script, and CCPNMR v2 has the Format Converter program [35,36]. The POKY suite contains multiple options for generating peak lists; of these, one of the easiest approaches is iPick. With iPick, the user simply selects one or more spectra from the session and clicks on the “Run iPick” button. After peak lists have been generated for each spectrum, the “Peak List” window opens, and by clicking on the “Save” button, the user can designate the names for the peak lists. Peak lists can be refined by hand or by software to remove noise or other spurious peaks.

2.2.2. Protein Sequence

ssPINE accepts peptide sequences in either one- or three-letter amino acid codes as ASCII text files. Sequences submitted in RTF (Rich Text Format; .rtf), ODT (OpenDocument Text; .odt), or DOCX (Office Open XML; .docx) are automatically converted to ASCII.

2.3. Output Files

The ssPINE output consists of several files: (1) The list of ssNMR experiments used. (2) A bar graph indicating the assignment probabilities of each residue in the protein (Figure 2). (3) Separate files with the backbone and sidechain chemical shift assignments of each residue of the protein in NMR-STAR 2.1 and 3.1 formats. (4) Sparky format assignment labels and frequencies. (5) Protein secondary structure prediction by PECAN (Protein Energetic Conformational Analysis from NMR chemical shifts) [37]. (6) Chemical shift referencing errors in each experiment, as detected by LACS (Linear Analysis of Chemical Shifts) [38], which is used in redefining offsets during the assignment iteration. (It is recommended that the user use these values to correct the offset for each peak list when a job is resubmitted. This will reduce the computational time and improve the assignment accuracy).

Figure 2

Bar graphs indicating the correct assignment probability (p) for each residue of GBI resulting from ssPINE analysis. Green indicates p greater than 0.99; cyan indicates p = 0.85–0.99; yellow indicates p = 0.5–0.84; red indicates p less than 0.5; and gray indicates no assignment (not seen with these test sets). (a) Unrefined GBI data as input. (b) Refined GBI data as input.

2.4. Data Used in Developing ssPINE

2.4.1. Data from GB1

In the early stages of developing ssPINE, we used unpublished ssNMR data from the uniformly 13C/15N-labeled small (56 residue, 6.2 kDa) protein GB1 that was generously provided by Chad Rienstra’s group. GB1, which is the streptococcal B1 immunoglobulin-binding domain of protein G20, has been used frequently as a standard sample in the development of NMR technology. We prepared both unrefined and refined peak lists from raw data from the following ssNMR experiments: 2D-CC, 2D-NCA, 2D-NCACB, 2D-NCO, 3D-NCACB, 3D-NCACX, 3D-NCACO, 3D-CANCO, 3D-CANCOCX, 3D-NCOCA, 3D-NCOCACB, and 3D-NCOCX. We prepared unrefined peak lists automatically with the iPick peak picking tool of POKY (two-letter-code iP). Subsequently, we created refined peak lists by using the cross-validation tool of iPick to weed out noise and non-sequential signals.

2.4.2. Other Protein NMR Data

Additional data from the PACSY database [21] were used in refining and optimizing the ssPINE algorithm. PACSY is a relational database that contains post-processed information from BMRB [39] and PDB [40]. Data from 82 proteins, including both large (181 residues) and small (26 residues) proteins, were included (SI Table S1). Most datasets were from solution NMR, because few ssNMR entries in the BMRB contain complete assignments. We created synthetic peak lists for 2D-CC, 2D-NCA, 2D-NCACB, 2D-NCO, 3D-NCACB, 3D-NCACX, 3D-NCACO, 3D-CANCO, 3D-CANCOCX, 3D-NCOCA, 3D-NCOCACB, and 3D-NCOCX ssNMR spectra of these proteins. For a more controlled evaluation, we only regarded sequential cross-peaks.

2.5. ssPINE Web Server

We utilized multiple technologies in implementing the ssPINE algorithm as a web server. Programs written in Perl, Python, and shell scripting handle various parts of the task. A web-facing server hosts a form that the user can fill out with their information: the amino acid sequence file and the peak lists from specified 2D and 3D solid-state NMR experiments. By clicking the “Submit” button, this information is validated and sent to a processing server. After the automated backbone and sidechain assignments are completed, the result is sent back to the user’s email address. From there, the user can download all the result files. The actual running time is determined by the size of the protein and the complexity of the problem, including peak list quality provided by the user, but jobs usually require less than one hour. The ssPINE web server is hosted at the University of Colorado, Denver and is accessible at: https://poky.clas.ucdenver.edu/ssPINE. No login or signup is required, and the server is open to all researchers at no cost and processes submissions in the order in which they are received.

3. Results

We evaluated the results with GB1 in terms of their completeness and correctness. “Completeness” is the number of automatically-assigned chemical shifts by ssPINE divided by the number of assignments for GB1 derived from our manual assignment of the ssNMR data. “Correctness” is the number of correct assignments made by ssPINE divided by the number of manual assignments. Given that ssPINE provides multiple assignment candidates with associated probabilities, only the assignment candidate with the highest probability is used in the evaluation of completeness and correctness. With the unrefined peak lists of GB1 as inputs, ssPINE yielded 100% (219/219) completeness and an average of 97.26% (213/219) correctness for the backbone chemical shift assignments (Figure 2a). With the refined peak lists of GB1, ssPINE yielded 100% (219/219) completeness and 100% (219/219) accuracy (Figure 2b). We also tested ssPINE algorithm with synthetic peak lists from other proteins whose assigned chemical shifts had been deposited in BMRB (see Section 2.4.2). These BMRB assignments are assumed to be correct and were used in evaluating the correctness of the ssPINE results. The numbers of BMRB and ssPINE assignments were used, respectively, as the denominator and numerator in the completeness calculation. The number of valid ssPINE assignments (“given” assignments) at the different probability cutoffs were used as the denominator in the correctness calculation. The total number of assignment candidates returned by ssPINE are plotted as a function of their probability scores in Figure 3a. They are shown as “correct”, “incorrect”, “given” (sum of correct and incorrect), and “all”. The “all” category includes “given” plus invalid assignments, namely those with scores below the probability cutoff.

Figure 3

Results from ssPINE analysis of synthetic ssNMR data as averages for the 82 proteins studied. (a) Chemical shift assignment probabilities returned by ssPINE for all assignment candidates (x-axis) versus assignment type (y-axis). All (dashed black), given (dashed blue), and correct (solid green) assignments are represented by the numbers on the left side, whereas the incorrect assignments (solid red) are represented by the numbers on the right side. (b) Data from the assignment candidate for each protein with the highest assignment probability. Completeness (solid blue) and correctness (solid green) are plotted as a function of that assignment probability.

The correctness and completeness parameters for all assignment candidates with the highest probability for each protein are plotted with respect to their probability in Figure 3b. The correctness decreased moderately as a function of lower probability. The fact that it remained above 85% means that more than 85% of the given chemical shift values were assigned correctly. Overall, the completeness ranged between 85% and 97%. The completeness increased abruptly between 1.0 and 0.9 probability, and then more gradually to 0.0 probability. Plots of percentages of completeness versus correctness for each BMRB entry at each probability are given in SI Figure S2. The unrefined GB1 peak lists led to a few incorrect 13Cα assignments (Figure 2a) because false signals picked by automated peak picking algorithm were close to the BMRB average chemical shift value. Manual refinement of the peak lists alleviated this problem by removing false-positive peaks, adding unpicked peaks, and resolving overlaps. Of the 82 synthetic sets of peak lists analyzed by ssPINE, only three yielded assignment correctness below 70% with a probability cutoff of 0.5. These are denoted by red circles in SI Figure S2 and by red text in SI Table S1. One of the poorest scoring datasets (completeness = 84.5% (474/561); correctness = 66.9% (317/474)) corresponded to BMRB entry 15,716 (the AlgE6R1 subunit from the Azotobacter vinelandii Mannuronan C5-epimerase), a 153 amino acid protein containing 27 glycine residues with many overlapping peaks in the carbon alpha region (~45 ppm).

4. Discussion

In this report, we have introduced the ssPINE algorithm for the automated analysis and assignment of solid-state NMR data from membrane proteins and other difficult protein systems. ssPINE builds on the technology of our I-PINE web server for solution NMR data, which serves several thousand jobs annually. We have adapted the I-PINE algorithm to account for the challenging features of ssNMR data from these systems. These include broader lines, extensive inter-residue dipolar interactions, and 2/3D ssNMR experiments that yield a variety of connectivities. As with I-PINE, ssPINE accepts the amino acid sequence of the protein and raw or refined peak lists as an input from a variety of NMR experiments (Table 1). The output of ssPINE includes peak assignments and their probabilities. We have tested and refined the implementation of the ssPINE algorithm with the excellent set of ssNMR data from the small protein, GB1. We also used ssPINE as an input for a set of synthetic peak lists that simulated ssNMR data from 82 other proteins of various sizes that were generated from solution NMR data deposited in BMRB. As shown above, the choice of probability cutoff is an important factor in maximizing correct assignments. In solution NMR, the recommended probability cutoff for I-PINE is 0.5 because it leads to a higher probability of correct assignments [20]. With ssNMR data, a cutoff of 0.6 appears to provide optimal completeness and assignment correctness. Glycine residues are harder to assign because they lack the CB signals that ssPINE uses to evaluate connectivities. Proteins that contain a high glycine content (e.g., BMRB entry 15,716) are particularly problematic because ssPINE has difficulty distinguishing among the several glycine candidates. Currently, the user can use the ssPINE extension in POKY (two-letter-code EP) to generate and submit peak lists from the web browser to the ssPINE webserver. The user can use the Convert (ss)I-PINE outputs to POKY plugin in POKY (two-letter-code ip) to convert the assigned chemical shift table file from ssPINE to the POKY resonance list file with the chosen probability cutoff. Finally, the POKY Notepad (two-letter-code Pn) can be used to propagate assigned peaks onto ssNMR spectra: this is enabled by the script, Simulate SSNMR peaks with assignments labels (predict-and-confirm). The analysis of ssNMR data from membrane proteins is highly challenging. ssPINE offers a promising approach for resolving the chemical, structural, and dynamic information contained in these spectra. Information of this kind is crucial for understanding the mechanisms underlying membrane transport, energy transfers, and signaling. We encourage feedback from users of ssPINE, particularly those analyzing ssNMR spectra of membrane proteins, as a means for guiding its further development. Our immediate goals with ssPINE are to incorporate information from strategies commonly used in NMR spectroscopy of membrane proteins, including mutational analysis, 19F labeling, and/or selective isotopic labeling. Longer-term plans are to develop and release a program (ssPINE-POKY) that will include a graphical user interface analogous to that in PINE-SPARKY.2 for solution NMR. In addition, we envision an “integrative” version of ssPINE that will increase assignment correctness and completeness by implementing adaptive probability density functions that incorporate machine learning (ML)-based chemical shift and structure prediction methods, and will provide a comprehensive visualization of structural and dynamic information from ssNMR data, which is analogous to that afforded by I-PINE for solution NMR data.

5. Web Server Availability

The usage of the webserver is described in Section 2.5. The web server for ssPINE is freely accessible at https://poky.clas.ucdenver.edu/ssPINE.

32 in total

Review 1. Structure determination of membrane proteins by NMR spectroscopy.

Authors: Stanley J Opella; Francesca M Marassi
Journal: Chem Rev Date: 2004-08 Impact factor: 60.622

Review 2. The structural study of membrane proteins by electron crystallography.

Authors: Y Fujiyoshi
Journal: Adv Biophys Date: 1998

3. Discovering high-affinity ligands for proteins: SAR by NMR.

Authors: S B Shuker; P J Hajduk; R P Meadows; S W Fesik
Journal: Science Date: 1996-11-29 Impact factor: 47.728

4. Automated solid-state NMR resonance assignment of protein microcrystals and amyloids.

Authors: Elena Schmidt; Julia Gath; Birgit Habenstein; Francesco Ravotti; Kathrin Székely; Matthias Huber; Lena Buchner; Anja Böckmann; Beat H Meier; Peter Güntert
Journal: J Biomol NMR Date: 2013-05-21 Impact factor: 2.835

5. Linear analysis of carbon-13 chemical shift differences and its application to the detection and correction of errors in referencing and spin system identifications.

Authors: Liya Wang; Hamid R Eghbalnia; Arash Bahrami; John L Markley
Journal: J Biomol NMR Date: 2005-05 Impact factor: 2.835

6. PACSY, a relational database management system for protein structure and chemical shift analysis.

Authors: Woonghee Lee; Wookyung Yu; Suhkmann Kim; Iksoo Chang; Weontae Lee; John L Markley
Journal: J Biomol NMR Date: 2012-08-19 Impact factor: 2.835

7. Automated protein resonance assignments of magic angle spinning solid-state NMR spectra of β1 immunoglobulin binding domain of protein G (GB1).

Authors: Hunter N B Moseley; Lindsay J Sperling; Chad M Rienstra
Journal: J Biomol NMR Date: 2010-10-08 Impact factor: 2.835

8. PINE-SPARKY: graphical interface for evaluating automated probabilistic peak assignments in protein NMR spectroscopy.

Authors: Woonghee Lee; William M Westler; Arash Bahrami; Hamid R Eghbalnia; John L Markley
Journal: Bioinformatics Date: 2009-06-03 Impact factor: 6.937

9. Probabilistic interaction network of evidence algorithm and its application to complete labeling of peak lists from protein NMR spectroscopy.

Authors: Arash Bahrami; Amir H Assadi; John L Markley; Hamid R Eghbalnia
Journal: PLoS Comput Biol Date: 2009-03-13 Impact factor: 4.475

10. PONDEROSA-C/S: client-server based software package for automated protein 3D structure determination.

Authors: Woonghee Lee; Jaime L Stark; John L Markley
Journal: J Biomol NMR Date: 2014-09-05 Impact factor: 2.835