Literature DB >> 32401799

Accelerating prediction of chemical shift of protein structures on GPUs: Using OpenACC.

Eric Wright¹, Mauricio H Ferrato¹, Alexander J Bryer², Robert Searles¹, Juan R Perilla², Sunita Chandrasekaran¹.

Abstract

Experimental chemical shifts (CS) from solution and solid state magic-angle-spinning nuclear magnetic resonance (NMR) spectra provide atomic level information for each amino acid within a protein or protein complex. However, structure determination of large complexes and assemblies based on NMR data alone remains challenging due to the complexity of the calculations. Here, we present a hardware accelerated strategy for the estimation of NMR chemical-shifts of large macromolecular complexes based on the previously published PPM_One software. The original code was not viable for computing large complexes, with our largest dataset taking approximately 14 hours to complete. Our results show that serial code refactoring and parallel acceleration brought down the time taken of the software running on an NVIDIA Volta 100 (V100) Graphic Processing Unit (GPU) to 46.71 seconds for our largest dataset of 11.3 million atoms. We use OpenACC, a directive-based programming model for porting the application to a heterogeneous system consisting of x86 processors and NVIDIA GPUs. Finally, we demonstrate the feasibility of our approach in systems of increasing complexity ranging from 100K to 11.3M atoms.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2020 PMID： 32401799 PMCID： PMC7250467 DOI： 10.1371/journal.pcbi.1007877

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

This is a PLOS Computational Biology Software paper.

Introduction

Computing architectures are ever-evolving. As these architectures become increasingly complex, we need better software stacks that will help us seamlessly port real-world scientific applications to these emerging architectures. It is also important to prepare applications that can be readily retargeted to existing and future systems without the need for drastic code changes while maintaining high performance. However, this is a complex and sometimes an impossible task to accomplish. Programming and optimizing for different architectures at a minimum often require codes to be written in different programming languages thus needing to maintain an entire secondary code base and presenting an inherent difficulty for software developers. While ideally, a single programming standard is preferred, it comes with challenges: (1) Poorly structured algorithms can hide parallelism from hardware (2) Features in a programming model are often hardware-facing and only occasionally application/user-facing, and (3) Hard to design many levels of abstractions to address all problems under study. Libraries, languages, and directives are three widely accepted software solutions. Libraries suffer from an inherent scope problem; they can only solve a specific subset of problems and are only designed for a specific subset of architectures. Languages are flawed because of the reasons previously outlined such as requiring the programmer to rewrite significant amounts of code. Directives are hints given to compilers to create necessary executables for the underlying platform. Directives strive to offer portability without losing performance. OpenMP [1, 2] and OpenACC [3] are two widely popular directive-based models. OpenMP is a shared-memory programming model that started to support heterogeneous computing systems since 2013 (OpenMP 4.0 offloading). Applications using the offloading model include Pseudo-Spectral Direct Numerical Simulation-Combined Compact Difference (PSDNS-CCD3D) [4] and Quicksilver [5]. OpenACC, ratified in 2011, has since been adopted widely by scientific developers, to port their large scientific applications—sometimes production code—to heterogeneous architectures. Some examples include ANSYS [6], GAUSSIAN [7], nuclear reactor code Minisweep [8], and Icosahedral non-hydrostatic (ICON) [9]. Both OpenMP and OpenACC allow incremental improvement to a given code base and help create a re-usable code for more than one architecture. This manuscript focuses on the OpenACC model. We use the PGI compiler after observing their OpenACC implementation’s maturity. GCC’s (by Mentor Graphics) also offers an OpenACC implementation, however at the time of running this experiment the implementation was not yet mature enough.

Overview of the scientific problem: Chemical shift prediction

Nuclear magnetic resonance (NMR) is an experimental technique employed in numerous fields such as chemistry, physics, biochemistry, biophysics and structural biology. A chemical shift, the principle observable in NMR instrumentation, provides valuable insight into protein secondary structure by allowing inference about conformation to be drawn based on peak shift. Measured in parts per million (ppm), a chemical shift describes the resonant frequency of a nucleus by comparing its observed frequency to that of a standard reference in the presence of a magnetic field. Magnetic resonance imaging, or MRI, is a familiar application of this powerful technology. A central challenge in NMR spectroscopy is the structural determination of large proteins. Biomolecular complexes, such as the protein envelopes, or capsids, which enclose and protect retroviral genomes often contain symmetries and comprise numerous repeated subunits creating difficulty in NMR experiments. Solid-state NMR (ssNMR) is a powerful emerging solution, and has successfully elucidated morphological details of multi-million atom complexes such as the HIV-1 protein capsid [10, 11]. With the growing sample sizes accessible to biomolecular experiments, the capability of software and hardware to process and analyze resulting data is also advancing. A method to calculate a continuum electrotatics model, particularly relevant in computational drug-binding studies, has been applied to a 20 million atom system and demonstrated parallel efficiency of 0.8, requiring less than a minute of wall time with 512 GPUs [12]. In biomolecular applications, this trend of increasing computing power is motivating data-driven solutions to problems such as parameterization of atomic force fields [13], or protein structure determination from electron microscopy data [14]. Computational tools to aid structure determination with NMR observables have materialized into a rich domain of protein study and protein chemical shifts have been used in varying ways to successfully elucidate structure. Commonly, these programs employ perusal of scientific databases to establish and parse relationships between shifts, sequence and structure [15-20]. Thanks to projects such as the BioMagResBank (BMRB) [21], NMR data is more available than ever before, engendering the feasibility of semi-empirical prediction methods which utilize existing chemical shift data to parameterize functional prediction models. Obviating the need for database searching and sequence matching is a semi-empirical method named PPM [22]. The goal of PPM is to provide a prediction model that could operate over NMR conformational ensembles, predict chemical shifts from structures and provide new dimensions of protein forcefield refinement, structural refinement, and ensemble validation—a goal which PPM met aptly. In a departure from ensemble analysis, PPM’s successor PPM_One introduced a static-structure based chemical shift prediction method that showed competitive accuracy with other software [23].

Motivation

Drawing from approximations of first principle calculations and trained with accessible NMR data, the PPM_One model considers chemical shift as a sum of discrete descriptors. These descriptors, which quantify chemical shifts due to ring current effects, hydrogen bond effects, dihedral angles, and more [22, 23], take the form of relatively simple, and differentiable, functions of the atomic coordinates. Considering these factors, PPM_One is a prime target for parallelization and optimization; to extend practical application of the software to larger structures, populous NMR ensembles, or molecular dynamics trajectories describing thousands of structures. While a suitable candidate to this end, the original PPM_One code was not written in a way to exploit the massive compute power of accelerators such as GPUs. In our work, we have ported the PPM_One application to utilize parallel hardware, such as GPUs, using OpenACC. This work makes the following contributions: Equip domain scientists with an accelerated version of PPM_One that functions in a realistic lab environment. Provide an accelerated chemical shift prediction code that can be adapted to large Molecular Dynamics packages. Demonstrate the feasibility and scalability of our approach in systems of increasing complexity ranging from 2,000 to 13,000,000 atoms.

Design and implementation

This section will discuss methods to determine the computationally intensive hotspots, steps taken to refactor the code, accelerate using OpenACC and incrementally improve the application of OpenACC directives.

Identifying computational hotspot in chemical shift prediction

Before accelerating or parallelizing a given code, we use the OpenACC-enabled profiler that comes packaged with the PGI compiler. The tool, PGPROF, displays detailed information about CPU and GPU performances. This information includes breakdowns by runtime, memory management, and accelerator utilization. Fig 1 shows the results of our profile when using a relatively small molecule (100,000 atoms). The profiler was particularly useful since we were unfamiliar with the code at the start of this project, and PGPROF quickly allowed us to identify which functions in PPM_One contained a lot of computation as well as which functions scaled in time-taken with the dataset size. The different computational functions detected are discussed in detail in the Target Functions for Acceleration section.

Fig 1

Visual representation of serial profiling data.

(A) The pie chart represents the time taken by the original version of the code. (B) The pie chart represents the time consumed by the different parts of the code after implementing various optimizations.

Visual representation of serial profiling data.

Initial code refactoring

Many of these functions in the original sequential code were written as a direct implementation of their respective algorithms. As a result they are under prepared for accelerators. For example, redundancy of memory copying caused by calling the getselect() function an unnecessary number of times. To fix this, we altered the code to only call getselect() once, and then store and reuse the associated memory. This optimization alone led to a 20% performance increase when running with some of the datasets. The next optimization we made was to a function called clear() that filters through a list of protons, removing any of them that do not work with the algorithm. The runtime (varying from hours to seconds) of this function varied greatly depending on which dataset was tested since some molecular structures require more protons to be filtered than others. As a result, we rewrote clear() to use a more efficient list filter that made the operation take only a few seconds or less for all structures. Lastly, we ran into some problems with the C++ STL containers that were used within the code. This mostly applied to the C++ standard vector class. To account for this, many C++ vectors were replaced with basic arrays, this allowed for more efficient communication with the GPU. In other places, we interfaced with the vector containers by using the built-in data() function to retrieve the underlying memory, allowing us to move the data to the GPU without the need to use extra libraries or code rewrites.

Acceleration using OpenACC

OpenACC exposes three levels of parallelism via the gang, worker and vector constructs that enables programmers to abstract the architecture along with maximally utilizing the potential of multicore or accelerators. Typically, compute-intensive portions of the program often identified by profilers are offloaded to the accelerators; a task orchestrated by the host by allocating memory on the accelerator device, initiating data transfer, offloading the code to the accelerator, passing arguments to the compute region, queuing the device code, waiting for completion, transferring results back to the host, and deallocating memory. With often only minor adjustments to memory management near parallelized compute regions, the model accommodates both shared and discrete memory or any combination of the two across any number of devices. The model has the capacity to expose the separate memories through the use of a device data environment. After ensuring that the code was accelerator-compatible, we began applying OpenACC directives to the code. We tackled each function individually in order of importance, meaning that we started with get_contact() and finished with getring(). Every time we made a meaningful alteration we would re-run the code on a few different datasets and compare the results to their non-accelerated baselines. This would let us know if we made any errors along the way. We decorated the major loops in the code with the OpenACC parallel loop directive. This will offload loops to the GPU automatically; sometimes just enough to see a speedup as some loops were embarrassingly parallel. However, in other cases we saw a significant slowdown and sometimes wildly incorrect code output compared to our serial baseline. These two problems were overcome by using other OpenACC features. To fix our incorrect output, we used both the reduction clause and atomic directive. Reduction clause handles race conditions. These are areas in the code that can result in errors when multiple parallel units overwrite each other in shared memory. The reduction clause prevents this by aligning memory reads/writes to produce a single coherent value. The atomic directive fills a similar purpose. However, it is useful in situations where many different race conditions could occur at different locations in memory. There was only one situation in our code where a reduction clause was not sufficient, and that was in the gethbond() function. Too many memory transfers between the host and device slowed down the code. After profiling our initial parallelization of the get_contact() function, we saw that the majority of the time was spent on transferring data between the host and device memory. Originally, get_contact() would be called many times throughout code execution (hundreds to thousands of times, depending on the dataset). We added a loop that would iterate over all of the individual get_contact() calls, which gave us another dimension to expose parallelism. This also means that no data would need to be transferred between the different calls of get_contact(). This change was beneficial because out of all of the functions get_contact() received the largest speed-up. The speed-up will be discussed in more detail in the Results section.

Target functions for acceleration

Each of the functions we have identified are important to the overall chemical shift prediction algorithm that PPM_One implements. get_contact(), one of the most important functions in the PPM_One algorithm, serves as the principle interface between the input coordinates and secondary structure contact data. get_contact() iterates over all atomic positions, given in the molecule, and computes a distance between each atom index and the successive atom index. Next, for each atom in each residue in the PPM_One input structure, the random-coil chemical shift for atoms in that residue is applied as a fit parameter to normalize the calculated chemical shift. Since this procedure must be carried out exhaustively over the entire structure and manages data from individual function calls and parameter tables, it takes up a proportionally large piece of the total runtime and can be a huge sequential bottleneck in the program. gethbond() computes the effect that backbone hydrogen bonding has on chemical shift. PPM_One describes this effect in terms of the inverse of donor-acceptor distance, and applies a descriptor based on the angle formed between two different atom triples, NHO and HOC′. Since every amino acid has donor-acceptor pairs, this function gets called with high frequency and involves distance and angle calculations for each donor and acceptor relative to the specified atom triples making gethbond() a meaningful target for parallelization and performance-gain despite its relatively simple formulation. The getani() function computes the chemical shift due to magnetic anisotropy. Magnetic anisotropy quantifies the directionally-dependent electromagnetic interactions between atoms. PPM_One employs this calculation for interactions between protons and peptide-amide groups consisting of Oxygen (O), Carbon prime (C′) and Nitrogen (N). Additional calls are made to getani() for side-chain OCN groups of Asparagine and Glutamine, OCO side-chain groups of Glutamate and Aspartate, and the NCN side-chain of Arginine. The formulation for the calculation used by PPM_One is known as the “axially symmetric model” [24], in accordance with McConnell’s characterization of anisotropy of peptide groups [25]. At each function call, the distance between the queried proton and the peptide-amide group is calculated. This, the vectors pointing from the proton to the peptide-amide group, and from the proton to the normal vector of the peptide-amide are used to compute an angle to pass into the magnetic anisotropy expression. getring() encompasses two different functions in the PPM_One program that calculate the chemical shift due to ring-current effects; one function calculates ring-current effects felt by Hydrogen atoms with respect to an aromatic residue, and the other calculates the effect felt by backbone atoms adjacent to an aromatic residue. PPM_One considers the aromaticity of amino acids Phe, Tyr, His, Trp-5 and Trp-6. The aromatic rings of these residues have important structural implications due to electrostatic induction, as the circular movement of delocalized electrons (ie, current) in conjugated Pi-bonding orbitals induces a magnetic field vector orthogonal to the plane described by the atoms of the ring. To quantify this effect, the queried atom’s position in cartesian space must be projected to a position on the 2D subspace defined by the plane of the aromatic ring. Additionally, distances between all atoms in the ring are calculated in this function each time it is called, making it costly to compute even though its application is limited to only aromatic residues and atoms in their local environment.

Results

This section will elaborate on the experimental setup and the results obtained.

Experimental setup

For the multicore, V100 and P40 results shown in both the tables, we use the PSG DGX-1b compute node consisting of Intel Xeon e5-2698 v4 20 cores and a single NVIDIA Volta V100 card and another compute node that has a single P40. For the serial runs shown in both the tables, since we could not get time on the PSG system, we have used our internal UDEL’s local system that has an Intel x990 core.

Datasets

Fig 2 shows the different datasets used for our experiments, represented to scale. The first tested dataset constitutes 100,000 atoms, roughly a quarter-turn, of the Dynamin GTPase (structure E) extracted and written to their own Protein Database (PDB) file. Structure B was the HIV-1 capsid assembly (CA) without Hydrogens. This structure was tested without Hydrogens for two reasons: 1) to limit the number of atoms for this test case and 2) to create a variety in the swath of tested structures. Structures C and D correspond to two variants of the HIV-1 CA, Hydrogens included. Structure C is the HIV-1 CA decorated with Cyclophilin A (CypA), structure D is the same HIV-1 CA decorated with Myxovirus resistance protein B (MxB). These two datasets, 5.1 and 5.9 million atoms respectively, were chosen as test cases of heterogeneous systems in addition to their increased atom counts compared to the undecorated HIV-1 CA. The HIV-1 CA test-structures are shown next to their dimeric building block 2KOD (structure A), illustrating the ranging scale and complexity of atomistic representations of biomolecules. Finally, the largest two test systems were built from the Dynamin GTPase. Structure E is a 6.8 million atom model, 14 turns, of the GTPase. The largest structure, containing 13.6 million atoms, constitutes 28 turns of the Dynamin GTPase. The secondary-structure of 2KOD was calculated using Stride [26]. All images were rendered using VMD 1.9.4 and the co-distributed, Tachyon parallel ray-tracing library [27, 28].

Fig 2

Visual rendering of used datasets.

Visual rendering of used datasets.

(A) The first tested dataset constitutes 100,000 atoms, roughly a quarter-turn, of the Dynamin GTPase extracted and written to their own Protein Database (PDB) file. (B) Structure B was the HIV-1 capsid assembly (CA) without Hydrogens. (C) Structure C is the HIV-1 CA decorated with Cyclophilin A (CypA). (D) structure D is the same HIV-1 CA decorated with Myxovirus resistance protein B (MxB). (E) Structure E is a 6.8 million atom model, 14 turns, of the GTPase. When running the PPM_One application, we noticed that the total runtime is proportional to the number of atoms contained in the molecule. However, this is not the only deciding factor. Between the different-sized molecules, the various compute-intensive functions saw a linear runtime increase compared to total number of atoms. However, the data preprocessing that the code does can vary greatly based on the molecule, and while we have made many improvements to this step it is still the bottleneck of the application. Also, the function gethbond() will take a significant amount of time for molecules that contain hydrogen, and almost no runtime for molecules that do not contain hydrogen. To accommodate for these runtime differences, we are mostly concerned with performance increase of a molecule on different platforms and less concerned with comparing different molecules to each other. When observing Table 1 we see a significant decrease in total runtime when comparing the serial (optimized) run to any of the accelerators. The multicore performance was 18x faster than the single core results. The Volta V100 results were 56x faster than single core, and 3.1x faster than multicore.

Table 1

Results for small to large dataset.

	100k atoms	1.5m atoms	5m atoms	6.8m atoms	11.3m atoms
Serial (Unoptimized)	167.11s	572.01s	3547.07s	7 hrs (esimate)	14 hrs (estimate)
Serial (Optimized)	53.57s	196.12s	2003.6s	1510.71s	2614.4s
Multicore	4.67s	32.82s	116.66s	153.8s	146.06s
P40	3.47s	17.15s	56.2s	78.57s	72.55s
V100	3.11s	13.62s	39.79s	49.63s	46.71s

For these results, an Intel Xeon e5-2698 v4 20 cores CPU and a NVIDIA Volta V100 GPU were used.

For these results, an Intel Xeon e5-2698 v4 20 cores CPU and a NVIDIA Volta V100 GPU were used. When observing individual function performance we see more significant speedup numbers as shown in Table 2. Comparing V100 results to the multicore results, the get_contact() function was sped up by 258x, gethbond() by 11x, getani() by 10x and getring() by 3x. Such a high speed up is common for functions that are purely compute intensive and hence can be easily optimized for GPUs. Since our major computational functions are seeing this amount of increase, we predict that much of the remaining total runtime is bound by other portions of the code such as file I/O or preprocessing. We have improved these parts of the code significantly since the start of this project (as seen when comparing the serial unoptimized numbers against the serial optimized). We do not believe that too much more could be done to improve these aspects without rewriting large portions of the code.

Table 2

Runtime for medium dataset by function.

5m atoms	Total Runtime	get_contact	getani	getring	gethbond
Serial (Optimized)	2003.60	1177.61s	58.95s	22.53s	708.07s
Multicore	116.66s	51.73s	2.4s	0.6s	25.39s
P40	56.2s	1.69s	1.06s	0.5s	17.05s
V100	39.79s	0.2s	0.24s	0.18s	2.35s

Validation of results: Calculation RMSE

To calculate the Root Mean Square Error (RMSE), we ran the unaltered code on a single core of a single CPU on 299 different PDB files. Then we reran each file with the developed OpenACC code on the same CPU core, but now with GPU offloading. The following numbers shown in Table 3 are collected by using the RMSE formula on every prediction of every file comparing the CPU and GPU output.

Table 3

RMSE difference between CPU and GPU code.

	C_a	C_b	C	HN	N	H_a
RMS error (ppm)	1.58e-4	8.48e-5	1.97e-4	5.22e-5	2.84e-4	1.02e-4
Max error (ppm)	0.013	0.008	0.017	0.007s	0.025	0.013

Next, we wanted to assess the prediction accuracy of the PPM_One code against experimentally derived chemical shifts. PPM_One reported root-mean-square prediction error for a set of validation structures [23], showing 0.9 ppm prediction error for Carbon alpha and 1.0 ppm error Carbon beta atoms, 1.41 ppm error for carboxyl Carbon atoms, 0.24 ppm error for Hydrogen alpha and 0.43 ppm for amide Hydrogen atoms, and 2.31 ppm error for Nitrogen atoms [23]. To compare the accuracy of GPU accelerated PPM_One with respect to experimental chemical shifts, chemical shifts were predicted for three structures which were not part of the PPM_One training or validation sets [29-31]. We found comparable root-mean-square prediction error to what was reported for PPM_One [23]: 1.12 ppm prediction error for Carbon alpha and 1.11 ppm error for Carbon beta atoms, 1.03 ppm error for carboxyl Carbon atoms, 0.55 ppm error for Hydrogen alpha and 0.71 ppm for amide Hydrogen atoms, and 1.41 ppm error for Nitrogen atoms. Together with the RMSE analysis between CPU and GPU versions of the code, we conclude that PPM_One provides robust and accurate chemical shift predictions which were unaffected by our GPU acceleration efforts.

Availability and future directions

The PDB files have been previously published and can be found here [32-34]. Our GitHub https://github.com/UD-CRPL/ppm_one contains the code used for this manuscript. Efficiently predicting chemical shifts is an important utility for many potential MD applications. With our GPU acceleration, we believe that PPM_One can now be used for predicting chemical shifts of large molecular structures. As part of the future work, for problems of magnitude larger than what we have studied, we will update the software to use MPI with OpenACC and scale across multiple nodes. 10 Mar 2020 Dear Dr. Chandrasekaran, Thank you very much for submitting your manuscript "Accelerating Prediction of Chemical Shift of Protein Structures on GPUs" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Dina Schneidman-Duhovny Software Editor PLOS Computational Biology Dina Schneidman-Duhovny Software Editor PLOS Computational Biology *********************** A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: This work present a systematic report that accelerate the PPM_One program by implementing portable optimization using OpenACC. It includes detailed description of the process helped improved the efficiency of the PPM_One orders of magnitude. It will greatly benefit the NMR communities and structural biology work involving NMR methods. The manuscript is nearly ready for publication, in my opinion. Here are some minor comments: 1. Page 2-3, Section "Overview of the Scientific Problems: Chemical shift Prediction: The citations in the first three paragraphs do not show up corrected. Please correct them. 2. Page 6, the end of paragraph 4: The speed-up will be discussed in more detail in Section 3.2. However, sections are not numbered, no idea which this refers to. 3. Page 7, first sentence of paragraph 2: The function getani() represents the compute region for calculating the chemical shift due to magnetic anisotropy. This sentence may need revision. 4. Table 1. The times for 6.8m atoms are longer than 5m atoms, which is easy to understand. However, why is it longer than 11.3m atoms? Any particular reason? Would love to see the explanation. Or maybe this reveals there is something else need optimization? Reviewer #2: The manuscript at high level is very well organized and written. This work will definitely help other MD application developers think of using / adapting to programming models such as OpenACC to accelerate their applications, without having to delve into the nitty-gritty's of CUDA. Minor revisions needed in the abstract and throughout the paper, comments indicated in the pdf (please open using Adobe Reader or similar). Comments: 1. Be explicit in the abstract that there was major refactoring on the cpu side of the code happened as well before accelerating the code to the GPUs. Just 14hrs to 46secs seems like the code was really really bad, and the comparison is not appropriate. Recommend tweaking the statement a bit to reflect the changes (note: this was clear in the results section but not in the abstract). 2. Some citations are missing, Fig 1 was missing (I think at least, not in the pdf that was uploaded), also recommend running the manuscript through Grammarly (or similar) for grammatical error fixes. 3. Are their plans of running such simulations / modeling on larger systems with a higher (and complex) node count and what do the authors think of as a challenge when running similar workloads with OpenACC + MPI (or OpenMPI) ? Maybe something along these lines in a future work, or ongoing work directions would be an appropriate addition to this manuscript. 4. Recommend outlining the challenges of accelerating and porting such an application using OpenACC in a subsection, this will help other MD app developers to take that leap. Reviewer #3: The manuscript describes a significant improvement on the run time of a chemical shift prediction program. Chemical shift prediction is an important area of ongoing research to leverage the large amount of data available from NMR experiments of protein. The manuscript does a very good job of addressing the case of solid-state NMR, which can tackle large protein assemblies. The test models were chosen well and demonstrate the challenge associated with chemical shift prediction. The results show important acceleration gains achieved, and the RMSD comparison clearly indicates a successful implementation. As such this is a significant contribution to the field. One minor suggestion is that the authors include a brief comment on how successful PPM_One is when compared to experimental NMR results (if available) for the various chosen models. ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see Submitted filename: ppm_plos_chandrasekaran.pdf Click here for additional data file. 8 Apr 2020 Submitted filename: Author_Responses_PCOMPBIOL-D-20-00060.pdf Click here for additional data file. 15 Apr 2020 Dear Dr. Chandrasekaran, We are pleased to inform you that your manuscript 'Accelerating Prediction of Chemical Shift of Protein Structures on GPUs: Using OpenACC' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Dina Schneidman-Duhovny Software Editor PLOS Computational Biology *********************************************************** 1 May 2020 PCOMPBIOL-D-20-00060R1 Accelerating Prediction of Chemical Shift of Protein Structures on GPUs: Using OpenACC Dear Dr Chandrasekaran, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Sarah Hammond PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

21 in total

1. A relational database for sequence-specific protein NMR data.

Authors: B R Seavey; E A Farr; W M Westler; J L Markley
Journal: J Biomol NMR Date: 1991-09 Impact factor: 2.835

2. Fast and accurate predictions of protein NMR chemical shifts from interatomic distances.

Authors: Kai J Kohlhoff; Paul Robustelli; Andrea Cavalli; Xavier Salvatella; Michele Vendruscolo
Journal: J Am Chem Soc Date: 2009-10-07 Impact factor: 15.419

3. VMD: visual molecular dynamics.

Authors: W Humphrey; A Dalke; K Schulten
Journal: J Mol Graph Date: 1996-02

4. PPM: a side-chain and backbone chemical shift predictor for the assessment of protein conformational ensembles.

Authors: Da-Wei Li; Rafael Brüschweiler
Journal: J Biomol NMR Date: 2012-09-13 Impact factor: 2.835

5. Machine Learning Force Field Parameters from Ab Initio Data.

Authors: Ying Li; Hui Li; Frank C Pickard; Badri Narayanan; Fatih G Sen; Maria K Y Chan; Subramanian K R S Sankaranarayanan; Bernard R Brooks; Benoît Roux
Journal: J Chem Theory Comput Date: 2017-09-01 Impact factor: 6.006

6. Assignment strategy for fast relaxing signals: complete aminoacid identification in thulium substituted calbindin D 9K.

Authors: Stéphane Balayssac; Beatriz Jiménez; Mario Piccioli
Journal: J Biomol NMR Date: 2006-02 Impact factor: 2.835

7. Structure of the Dimerization Interface in the Mature HIV-1 Capsid Protein Lattice from Solid State NMR of Tubular Assemblies.

Authors: Marvin J Bayro; Robert Tycko
Journal: J Am Chem Soc Date: 2016-06-28 Impact factor: 15.419

8. TALOS+: a hybrid method for predicting protein backbone torsion angles from NMR chemical shifts.

Authors: Yang Shen; Frank Delaglio; Gabriel Cornilescu; Ad Bax
Journal: J Biomol NMR Date: 2009-06-23 Impact factor: 2.835

9. Cryo-EM of the dynamin polymer assembled on lipid membrane.

Authors: Leopold Kong; Kem A Sochacki; Huaibin Wang; Shunming Fang; Bertram Canagarajah; Andrew D Kehr; William J Rice; Marie-Paule Strub; Justin W Taraska; Jenny E Hinshaw
Journal: Nature Date: 2018-08-01 Impact factor: 49.962

10. BioMagResBank.

Authors: Eldon L Ulrich; Hideo Akutsu; Jurgen F Doreleijers; Yoko Harano; Yannis E Ioannidis; Jundong Lin; Miron Livny; Steve Mading; Dimitri Maziuk; Zachary Miller; Eiichi Nakatani; Christopher F Schulte; David E Tolmie; R Kent Wenger; Hongyang Yao; John L Markley
Journal: Nucleic Acids Res Date: 2007-11-04 Impact factor: 16.971