Literature DB >> 29136098

Sequential search leads to faster, more efficient fragment-based de novo protein structure prediction.

Saulo H P de Oliveira¹, Eleanor C Law¹, Jiye Shi^2,3, Charlotte M Deane¹.

Abstract

Motivation: Most current de novo structure prediction methods randomly sample protein conformations and thus require large amounts of computational resource. Here, we consider a sequential sampling strategy, building on ideas from recent experimental work which shows that many proteins fold cotranslationally.
Results: We have investigated whether a pseudo-greedy search approach, which begins sequentially from one of the termini, can improve the performance and accuracy of de novo protein structure prediction. We observed that our sequential approach converges when fewer than 20 000 decoys have been produced, fewer than commonly expected. Using our software, SAINT2, we also compared the run time and quality of models produced in a sequential fashion against a standard, non-sequential approach. Sequential prediction produces an individual decoy 1.5-2.5 times faster than non-sequential prediction. When considering the quality of the best model, sequential prediction led to a better model being produced for 31 out of 41 soluble protein validation cases and for 18 out of 24 transmembrane protein cases. Correct models (TM-Score > 0.5) were produced for 29 of these cases by the sequential mode and for only 22 by the non-sequential mode. Our comparison reveals that a sequential search strategy can be used to drastically reduce computational time of de novo protein structure prediction and improve accuracy. Availability and implementation: Data are available for download from: http://opig.stats.ox.ac.uk/resources. SAINT2 is available for download from: https://github.com/sauloho/SAINT2. Contact: saulo.deoliveira@dtc.ox.ac.uk. Supplementary information: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Caspase 12

Year: 2018 PMID： 29136098 PMCID： PMC6030820 DOI： 10.1093/bioinformatics/btx722

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

A standard de novo protein structure prediction pipeline consists of randomly sampling the conformational space to identify minimum-energy conformations. This sampling is usually carried out via a Monte-Carlo search (Raman ); by causing perturbations to a fully elongated protein chain and accepting/rejecting the resulting conformations based on an acceptance probability. This probability is defined in terms of a scoring function that combines physical and statistical terms. After many successive perturbations, a conformation is output. The model generation protocol is repeated via multiple independent runs to produce a large number of candidate models (decoys). This process tends to be computationally intensive; one estimate suggests that it takes approximately 150 CPU days to accurately predict a protein’s structure (Abbass and Nebel, 2015). There has been significant effort to test different sampling strategies to improve both the efficiency and the performance of de novo protein structure prediction. Replica Exchange Monte Carlo has been used as an extension to the traditional Monte Carlo protocol in different implementations (Blaszczyk ; Kosciolek and Jones, 2014; Xu and Zhang, 2012) and it has been suggested as a more efficient sampling technique. Evolutionary algorithms have also been applied to structure prediction in order to detect multiple candidate energy minima conformations (e.g. Custodio ; Garza-Fabre ; Zhang ). Other search strategies include the optimization of a multi-objective function (Olson and Shehu, 2014), or approaches based on molecular dynamics (Perez ). Deviating from the random-restart strategy used in conventional protocols, search algorithms have also been implemented to extract information from decoys that have been produced to improve subsequent modeling runs (re-sampling) (e.g. Brunette and Brock, 2008; Mabrouk ; Shrestha and Zhang, 2014). In several of its implementations, re-sampling has been shown to improve the results of the Monte Carlo search implemented in Rosetta (Simoncini ; 2017). Another perspective is to explore probabilistic frameworks such as Hidden Markov Model sub-optimal sampling (Lamiable ) and conditional sampling from a united-residue probabilistic model (Bhattacharya ). The latter was based on experimental evidence supporting the notion of foldon units. These probabilistic frameworks aim to break down the problem of folding into smaller local folding problems. Here, we propose a similar reductionist effect by performing the Monte Carlo search in a sequential fashion, reducing the global folding problem to a more tractable, local conformational search. Sequential search strategies have been previously explored. A modified version of ROSETTA (Raman ) was used to perform a comparison between predictions generated using a fully elongated protein chain and predictions performed sequentially (Ellis ). Predictions performed sequentially using the modified ROSETTA were shown to be better than predictions generated non-sequentially for approximately half of the cases. Sequential protein structure prediction has also been used in the ab initio transmembrane protein protocol of ROSETTA (Yarov-Yarovoy ). The strategy starts with a helix in the middle of the protein, then adds further transmembrane helices randomly at either the C- or N-terminal end. Regardless of the search strategy used to sample conformations, modeling success is highly dependent on the accuracy of the scoring function. Improvement of existing scoring potentials has been the focus of several articles published recently (Chae ; Ovchinnikov and 2015b; O’Meara ; Yang and Zhang, 2015). In particular, pairwise potentials based on distance restraints inferred from co-evolution information have made consistent and accurate template-free structure prediction possible (e.g. de Oliveira ; Jones ; Kamisetty ; Marks ), when a sufficient number of homologue sequences is available. Metagenomics from microbial DNA has been used to complement this sequence information, further broadening the applicability of such approaches (Ovchinnikov ). Contact predictions have been shown to be critical for modeling success. As co-evolution methods have only recently become a standard part of de novo protein structure prediction, most search strategies (including the sequential implementations of ROSETTA) have not incorporated these distance restraints in their tests. One exception is the work described in (Jones ), in which sequential incorporation of distance restraints let to better modeling results when the contact order was sufficiently high. One of the main limitations that has not been addressed by contact prediction is the fact that de novo structure predictors still require a large amount of computational resources for accurate and consistent modeling. This relates to the large number of decoys that need to be produced during model generation and to the large number of moves performed to generate a single decoy. There is no consistency in terms of the number of decoys that need to be produced across different prediction software (Supplementary Table S1). Three recent studies using the software ROSETTA describe the use of 10 000 (Ovchinnikov ), 20 000 (Ovchinnikov ), and 20 000–900 000 (Kim ) decoys per target, meaning that the consensus is not clear even for the same structure predictor. Furthermore, no rationale as to how many decoys should be produced is presented in articles describing different methods, and for some cases the choice appears arbitrary. Here, we investigate whether a sequential search heuristic could be used to improve both the efficiency and the accuracy of template-free protein structure prediction. To do so, we developed SAINT2, a completely independent fragment-assembly structure predictor. SAINT2 differs from conventional fragment-assembly approaches as it is able to perform predictions either sequentially, starting from either terminus, or non-sequentially, similar to traditional structure prediction software such as ROSETTA. Both sequential and non-sequential protocols use exactly the same parameters and input to facilitate unbiased comparison between the two modes. Given that successful de novo modeling is reliant on accurate contact-prediction, SAINT2 incorporates predicted protein contacts into its modeling routine. First, we present a rationale for the number of decoys that SAINT2 needs to generate in order for a correct answer to be produced. We then compare the run time and the modeling results of SAINT2’s sequential and non-sequential approaches on validation sets of 41 soluble proteins and 24 transmembrane proteins. Our results show that sequential protein structure prediction requires fewer decoys to be produced, produces individual decoys significantly faster and is capable of consistently generating better models.

2 Materials and methods

We have implemented a sequence-to-structure pipeline to perform de novo protein structure prediction (see Supplementary Fig. S1). Our pipeline takes as input a target sequence for which we generate secondary structure predictions using PSIPRED (Jones, 1999), torsion angle predictions using SPINE-X (Faraggi ; 2012), a fragment library using Flib (de Oliveira ), and, when possible, residue–residue contact predictions using metaPSICOV (Jones ) as it was shown to produce the most accurate predictions (de Oliveira ) (for full details see Supplementary Material). The final step in our pipeline is to generate structure predictions using SAINT2. SAINT2 requires the output files of steps one to four to generate models.

2.1 Fragment library

Flib (de Oliveira ) is used to generate the fragment libraries for SAINT2. Flib extracts fragments from a curated database of known structures. This database is a non-redundant (sequence identity ), high quality (resolution < 2.5 Å) subset of the PDB (Berman ). HHSearch (Söding, 2005) is used to identify and remove homologs to the target from this database in order to represent a realistic de novo structure prediction scenario. Fragments are selected from structures in this database based on the target’s sequence, predicted secondary structure and predicted torsion angles. On average, Flib generates ∼30 fragments per target position that are 6–20 residues long. The same fragment library is used for a target in all three modes of SAINT2.

2.2 SAINT2

SAINT2 is based on a heuristic that treats protein structure prediction as a global optimization problem. Its energy function is a combination of different knowledge-based and physical potentials (see Supplementary Material for more details). The conformational space is sampled using a library of fragments of known structures, performing successive fragment replacements on an existing peptide conformation (Supplementary Fig. S2). SAINT2 builds models for a given target using the 3-D Cartesian coordinates of the five main backbone atoms (C, N, O, C-α and C-β). These coordinates are calculated using the fragment library (see Supplementary Material for more details) and completed using ideal bond lengths. SAINT2 does not consider side-chains explicitly. Three different modes have been implemented within SAINT2: Forward, Non-sequential and Reverse. SAINT2 Forward is initialized by selecting a fragment from the fragment library corresponding to the N-terminal residues of the target protein. In this mode, the peptide will grow as the simulation is executed. The direction of peptide extrusion is N-terminal to C-terminal. The reverse mode is analogous to the forward mode, but the initialization occurs at the C-terminus. In the reverse mode, the peptide will also grow as the simulation is executed, but the direction of peptide extrusion is reversed (C-terminal to N-terminal). SAINT2 can also perform fragment-assembly in a similar fashion to traditional approaches such as ROSETTA. We refer to this as non-sequential structure prediction. In the non-sequential mode, SAINT2 is initialized with a fully extruded protein conformation where the torsion angles are set to 180° and ideal bond lengths and angles are used. In the analyses described in this manuscript, an identical number of moves is used for each of the three modes of SAINT2.

2.3 Model generation

The Forward mode of SAINT2 is outlined in Supplementary Figure S2. Here, we outline each of the stages of our model generation routine.

2.3.1 Fragment replacement step

Fragments are selected at random from the fragment library. The probability of selecting a given fragment is proportional to the fragment score assigned by Flib, which is based on the predicted Torsion angle score (Supplementary Fig. S2).

2.3.2 Extrusion steps

Extrusion steps are a specific type of fragment replacement that always takes place at the end of the existing conformation, growing the peptide by one residue. For the forward mode, an extrusion always occurs at the C-terminal end of the peptide, in which a fragment representing the C-terminal is randomly selected from the fragment library (see Supplementary Fig. S2). The extrusion replacement always adds a new residue to the existing peptide conformation. For the Reverse mode, extrusion occurs in an analogous fashion, but at the N-terminal end of the peptide. The new conformation resulting from an extrusion step is always accepted. No extrusions are performed for the non-sequential mode, as in this mode the initial conformation already contains all the residues of the target. Different increments of up to 10 residues were tested for extrusion steps and produced comparable results. We have chosen to use an increment size of one residue. Extrusion steps use a different fragment as opposed to extending the existing fragment by one residue. However, this choice does not affect the results as extrusion steps are always accepted and a significant number of move steps is performed between extrusions.

2.3.3 Move steps

Move steps are fragment replacements that take place at random positions in the existing peptide conformation. Unlike extrusion steps, move steps do not append new residues at the end of the sequence.

2.3.4 The Mover

The mover is responsible for swapping between move and extrusion steps in the Sequential and Reverse modes of SAINT2 (for more details see Supplementary Material).

2.3.5 Score

SAINT2 uses a combined knowledge-based and physical potential that consists of five different components: RAPDF, Lennard-Jones, solvation, predicted secondary structure, and predicted inter-residue contacts. The score is a weighted sum of each of its five components (refer to Supplementary Material for more details).

2.3.6 Decoy selection

SAINT2 samples the conformational space by generating thousands of decoys for each target (see Determining the Number of Decoys section). Decoys are ranked according to our combined knowledge-based and physical potential (see Score section of Supplementary Material).

2.4 Datasets

SAINT2 was trained using a set of 43 structurally diverse proteins extracted from the PDB (Berman ). A full list of these proteins is given in Supplementary Table S1. These proteins are all single chain, single domain proteins proportionally distributed among the four SCOP (Murzin ) protein classes: all α, all β, , and . They are also evenly spread in terms of length, ranging from 59 to 508 residues. Each of the proteins in our dataset belongs to a different Pfam family (Punta ). Our sequential comparison analyses were carried out on two validation datasets: a soluble set of 41 structurally diverse proteins and a transmembrane set of 24 α-helical bundles extracted from the PDB (Berman ). A full list of these proteins is given in Supplementary Table S2. For the soluble set, analogous to the training dataset, the proteins are all single chain, single domain proteins proportionally distributed into the four SCOP (Murzin ) protein classes. They are also evenly spread in terms of length, ranging from 54 to 504 residues. For the transmembrane set, a set of polytopic α-helical transmembrane chains was taken from the Orientations of Proteins in Membranes (OPM) datbase (Lomize ). Taking only unbroken chains, the set was culled to keep no more than one member of each family, as defined by the OPM database, and also culled by PISCES (Wang and Dunbrack, 2003) so that the maximum sequence identity between chains was 20%. By manual inspection, chains were selected which had no soluble domain, consisted of at least four helices, and formed a single transmembrane domain. We used only the 24 shortest proteins in the resulting set, ranging from 132 to 385 residues in length. There is no overlap between the Pfam families in the training and validation sets.

2.5 CASP12 dataset

We used SAINT2 to generate models for 23 free-modeling domains from CASP12. For this comparison, we considered only free-modeling targets and included all the domains for which structural data was available on the CASP12 website. Only sequence and structure information available before the beginning of CASP12 was used in this analysis (we excluded all structures and sequences published after April 2016 from our databases).

2.6 Validation

TM-Score (Xu and Zhang, 2010; Zhang and Skolnick, 2004) was used to evaluate the quality of the decoys generated by SAINT2. Three different TM-Score based measures were defined to help assess our results: TM-Score of the best decoy (TM-Score Best): computed by selecting the decoy from among all decoys generated with the highest TM-Score compared to the target’s native structure. TM-Score of the Top 5 decoys (TM-Score Top-5): the TM-Score of the best decoy among the five top decoys output by our sequence-to-structure pipeline.

3 Results

3.1 Determining the number of decoys required by SAINT2

Successful de novo protein structure prediction methods tend to rely on brute force approaches that generate hundreds of thousands of conformations (Kandathil ; Moult ). Therefore, accurate template-free modeling is heavily dependent on the availability of large computational resources. As seen in Supplementary Table, there is not a consensus as to the number of decoys that need to be produced across different methods or even for a single method. It is hard to draw a comparison across different predictors as some perform significantly higher numbers of moves for a single decoy and produce a smaller number of decoys. Little analysis has been done to assess how many decoys are actually needed in order to obtain a good answer. The only common result is the suggestion that the longer the protein, the larger the number of decoys needed (Kim ; Moult ; Simoncini and Zhang, 2013; Xu and Zhang, 2012). Given the recent improvements obtained by incorporation of co-evolution restraints into prediction pipeline, it is possible that more efficient search heuristics could be used to reduce the number of decoys used. SAINT2 Forward, which performs prediction sequentially starting with a fragment representing the N-terminus and gradually growing this peptide as the conformational space is sampled, was used to generate 100 000 decoys for each of the 43 proteins in our training dataset. Correct answers [TM-Score to native structure > 0.5 (Xu and Zhang, 2010)] were generated for 25 targets. These 25 cases were used to estimate how many decoys are necessary to obtain a correct answer and a ‘best’ model in the 100 000 decoy ensemble (Fig. 1). We define a ‘best’ model as a decoy within 0.05 TM-Score units of the best possible solution in the 100 000 ensemble. In order to identify the number of decoys required to produce either a ‘best’ model or a correct answer, we sampled decoys from the ensemble. A hundred samples of each size were taken and we noted the sample size (number of decoys) needed to observe at least one correct answer (Supplementary Fig. S3) or ‘best’ model in over 95% of samples of a given size.

Fig. 1.

Number of decoys required by SAINT2 to produce a correct answer or a ‘best’ model. We generated 100 000 decoys for each target in the training dataset and have estimated both the number of decoys required to produce a correct answer (A) and to produce a ‘best’ model (B). A correct answer is one with TM-Score to the native structure greater than 0.5 and a ‘best’ model is a decoy within 0.05 TM-Score units of the best possible solution produced in the 100 000 decoy ensemble (see Supplementary Fig. S3 for more details). Proteins are coloured according to their SCOP classes Our results show that when fewer than 10 000 decoys are generated, SAINT2 consistently produces a correct answer for 20 targets and a ‘best model’ (as good as any in the 100 000 ensemble) for 14 targets (Fig. 1). We analysed whether sequence features could be used to estimate the number of decoys required to consistently produce a correct answer (Fig. 2). For the proteins shorter than 250 residues where SAINT2 has produced a correct answer, this answer can consistently be produced when fewer than 10 000 decoys have been generated. For proteins longer than 250 residues, a larger number of decoys had to be output to achieve this consistency. Other than this binary behavior, no correlation was observed between length and the calculated required number of decoys. When considering the number of loop positions, similar results were observed. Results for the SCOP classes show that SAINT2 can consistently generate a correct answer for most All α and All β proteins in our dataset when fewer than 5000 decoys are produced. However, no correct answer was produced for any All β proteins longer than 250 residues.

Fig. 2.

Correlation between the number of decoys required by SAINT2 to produce a correct answer and three sequence-based features. The x-axis represents a feature, protein length (A), number of predicted loop positions (B), and SCOP class (C). The y-axis is the number of decoys required to generate a correct answer for the 25 targets in our Training dataset where SAINT2 produced a correct answer Sequence features were also compared against the number of decoys required to consistently produce a ‘best’ model for each protein in our Training DataSet (Supplementary Fig. S4). We observe no correlation between our estimate for the number of decoys required to produce a ‘best’ model and protein length or number of loop positions. Results for the SCOP classes show that proteins tend to require fewer decoys compared to other SCOP classes in order for a ‘best’ model to be generated. We also tested for a relationship between the number of decoys required to produce a correct answer or a ‘best’ model with the fragment library precision and the precision of predicted residue-residue contacts and observed no correlations (Supplementary Figs S5–8). We have assessed the number of decoys that need to be generated by SAINT2 so that a correct answer or a ‘best’ model is produced. Our results reveal that generating 10 000 decoys is sufficient to ensure that a correct answer is produced for most cases. We have also shown that the number of decoys required to produce a ‘best’ model shows little correlation to protein length and is lower for proteins belonging to α + β SCOP classes. These results allow us to estimate the number of decoys that SAINT2 should generate for a given target.

3.2 Impact of sequentiality on the quality of soluble protein models

The main concern when employing a pseudo-greedy search heuristic is local minimum entrapment, especially when considering a rugged objective function. We have, therefore, investigated the impact of sequentiality on the quality of models. We performed a comparison between SAINT2 Forward and SAINT2 Non-sequential to evaluate performance on a validation set of 41 soluble proteins. We generated 10 000 decoys for each of the 41 targets in our soluble validation set using SAINT2 Forward and SAINT2 Non-sequential. For this comparison, identical fragment libraries, contact predictions and number of moves to generate a decoy were used. In that sense, both approaches were nearly identical. No aspect of our pipeline was designed to favor sequential prediction over its non-sequential counterpart. In fact, the knowledge-based potentials in SAINT2 were developed for a non-sequential prediction pipeline. Decoys generated using SAINT2 Forward tend to present a higher TM-Score Best when compared to SAINT2 Non-sequential (Fig. 3—left). SAINT2 Forward produced a correct answer (Best TM-Score > 0.5) for 18 cases whereas SAINT2 Non-sequential produced a correct answer for only 13 cases. There are no cases where a correct answer was generated using the SAINT2 Non-sequential that was not also produced sequentially. SAINT2 Forward predictions were better in 10 out of the 13 cases for which SAINT2 Non-sequential generated a correct answer.

Fig. 3.

Comparison of the TM-Score Best for a validation set of 41 soluble proteins (left) and 24 transmembrane proteins (right) obtained using SAINT2 Forward (x-axis) against SAINT2 Non-sequential (y-axis). Points below the diagonal indicate cases where sequential prediction performs better than non-sequential prediction. Point size indicates protein length and point colour indicates the protein SCOP class These trends are reproduced if we look at the TM-Score Top-5 (Supplementary Fig. S9). SAINT2 Forward presents a higher TM-Score Top-5 than SAINT2 Non-sequential in 33 of the 41 cases, generating 11 correct answers (TM-Score Top-5 > 0.5). SAINT2 Non-sequential produced a correct answer amongst its top five scoring decoys for only in 3 of the 41 cases. It has previously been suggested that proteins belonging to the SCOP class and longer proteins are more likely to fold in a sequential fashion (Deane ; Saunders ). We used our SAINT2 modes to assess the relationship between length/SCOP class and the improvements observed by performing predictions sequentially. SAINT2 Cotranslational presents a higher TM-Score Best for 9 out of 10 proteins in our soluble validation set (as shown in Fig. 3). Furthermore, the effect seems to be stronger (i.e. the differences between the TM-Score Best of SAINT2 Forward and Non-sequential are larger) for proteins. We observe no relationship between model improvement and protein length. We have also compared the run time of our different modes of SAINT2. The time complexity of our fragment assembly approach is quadratic on the number of atoms. Given that SAINT2 Forward performs many moves on a reduced number of atoms (a portion of the full protein chain), it can generate decoys at least 1.5 times faster than SAINT2 Non-sequential (Supplementary Table S3). Overall, our results show that SAINT2 Forward employs a more efficient search approach and produces better models for a majority of modeling cases. There are no cases where a conventional, non-sequential search strategy is capable of producing a model that is significantly better than the ones generated sequentially. The improvement in modeling results was observed across all protein lengths and across all SCOP classes represented in our validation set.

3.3 Impact of sequentiality on the quality of transmembrane models

We also used SAINT2 to test whether a sequential approach is a more efficient way to sample the conformational space and generate accurate decoys than a standard, non-sequential approach for transmembrane α-helical bundles. These are known to be inserted cotranslationally into the membrane, so a fragment assembly protocol that imitates this process may succeed by following the natural folding pathway. We generated 10 000 decoys for each of the 24 targets in our transmembrane set using SAINT2 Forward and SAINT2 Non-sequential. For this comparison, identical fragment libraries, residue–residue contacts and number of moves to generate a decoy were used. We compared the best TM-score of all decoys (TM-score Best) produced by each of SAINT2 Forward and SAINT2 Non-sequential (Fig. 3). For 18 out of 24 proteins, SAINT2 Forward produces a more accurate decoy. In two cases, the improvement in TM-score Best for sequential over non-sequential is > 0.15. There are five cases where SAINT2 Forward produces a correct answer (TM-score > 0.5) and SAINT2 Non-sequential does not. Two of these correspond to the longest proteins in the set (292 and 324 residues) for which a correct answer was produced. There were just two cases where a correct answer was only produced non-sequentially.

3.4 CASP12 results

We have also compared our sequential approach to state-of-the-art prediction software. We used SAINT2 Forward to produce 10 000 decoys for the 23 free-modeling domains from CASP12 for which structural data were available. We compared the results obtained by SAINT2 Forward against the most successful predictor in CASP12, the Baker Group (Supplementary Fig. S10). As our models are postdiction rather than prediction, this comparison should only be used to see if SAINT2 Forward can produce results of similar quality. We compared the best model submitted by the Baker Group to CASP12 (best-of-five) to the five highest scoring models produced by SAINT2 Forward (best-of-five). Given that we currently do not have a model selection protocol for SAINT2, we have used the scoring function developed for model generation to select these (Supplementary Fig. S10A). We have also provided results for the best model produced by SAINT2 Forward (Supplementary Fig. S10B). The Baker group submitted models with correct topology (TM-Score > 0.5) for 4 out of the 23 free-modeling targets. SAINT2 Forward produced a model with correct topology for eight targets but only three of these cases were amongst the five highest scoring models. Our results show that models produced by SAINT2 Forward are of comparable quality to the state of the art.

3.5 Investigating the effect of directionality during model generation

The ROSETTA ab initio membrane protocol uses an incremental but bi-directional method to build decoys (Yarov-Yarovoy et al., 2006). It is therefore using sequential sampling, but not in the direction of protein synthesis. We assessed how directionality may affect conformational sampling by comparing SAINT2 Forward to SAINT2 Reverse, which performs the same sequential protocol, but in the non-biological C- to N-terminal direction. We generated 10 000 decoys for each of the 41 proteins in our soluble set and for the 24 targets in our transmembrane set using SAINT2 Forward and SAINT2 Reverse (Supplementary Fig. S11). For both modes, identical fragment libraries, residue–residue contacts and number of moves to generate a decoy were used. We observed small differences between the TM-Score Best of models generated for the soluble set by SAINT2 Forward and SAINT2 Reverse. For the transmembrane set, little difference was observed.

4 Discussion

In this paper, we have investigated the behavior of a sequential search heuristic for fragment-based de novo structure prediction. Our aim was to test whether a pseudo-greedy search strategy could reduce the computational cost of accurate de novo modeling. Our initial study assessed how efficiently our sequential predictor SAINT2 can produce a model of similar quality to its best possible model. There is a general perception in the literature that hundreds of thousands of decoys are required for a correct model to be produced (Kim ; Simoncini and Zhang, 2013; Xu and Zhang, 2012), whereas little evidence is presented as to how many decoys are required to produce a sufficiently good answer. SAINT2 Forward, in the majority of cases where a correct answer is produced, is able to produce it when less than 10 000 decoys are generated. This suggests that SAINT2 is either more efficient at sampling the conformational space or that other methods are generating an excessive number of decoys. It is possible that the number of decoys could be further reduced by optimization of the sequential search in terms of satisfying the distance constraints. As the number of available homologue sequences increases, so does the precision of contact prediction (de Oliveira ), which may enable greedier strategies. Traditional structure predictors always perform moves on and score a full protein conformation. SAINT2 Forward optimizes performance by performing moves on and scoring a peptide that is shorter than the target protein. In the analyses in this manuscript, where we use the same number of moves for the two methods, SAINT2 Forward is capable of generating an individual decoy between 1.5 and 2.5 times faster than SAINT2 Non-sequential. This means that regardless of the number of decoys produced, a sequential approach can significantly reduce the computational cost of accurate structure modeling. One possible issue with using a sequential protein structure predictor is the idea of local entrapment, that by folding the N-terminal residues before the rest of the protein, they could become trapped in a local minimum that is not relevant for the global fold. This type of entrapment does not appear to influence our methodology as, on our soluble and transmembrane validation sets, SAINT2 Forward generates more correct models and better models than SAINT2 Non-sequential. Considering a threshold of 0.02 TM-score units to establish model similarity (Kryshtafovych ; Li ), there are no cases across all soluble and membrane cases where SAINT2 Non-sequential predictions are significantly better than SAINT2 Forward. It is arguable that the improvement in modeling results described in this work could be a consequence of the particular implementation used in SAINT2. The same implementation was used in all modes of SAINT2 (the same potentials, scoring functions, heuristics and parameters were used). Furthermore, we have used potentials that have been developed, trained, and used in non-sequential protocols. Sequentiality had also been previously explored in transmembrane protein structure prediction by ROSETTA (Yarov-Yarovoy et al., 2006). Therefore, it seems that the improvement observed for sequential predictions is unlikely to be a consequence of our implementation. We carried out a comparison of SAINT2 Forward’s performance on the CASP12 targets in order to establish whether its results are equivalent to those of the best performers within CASP. However, as our models are postdiction (though every care was taken to remove information available post CASP12—see Methods) we see these results as indicative rather than definitive. Models with a TM-score above 0.5 may not be useful for a large number of biological applications. Nonetheless, results from the most recent iteration of CASP show that, in the absence of a reliable template, protein structure predictors rarely achieve TM-Scores greater than 0.8. The most successful template-free predictor (Baker Group) produced a model with a TM-Score greater than 0.8 for only one free-modeling domain in our CASP12 dataset. SAINT2 was able to produce models with TM-Score greater than 0.8 for approximately 10% of its soluble targets (4 out of 41). However, no model of this quality was produced for the targets in our transmembrane and CASP12 sets. Currently, our structure prediction pipeline does not have a model selection protocol implemented. Therefore, for our CASP12 comparison, we considered both the best-of-five (as selected by SAINT2 score, see Results) and the best model out of all 10 000 decoys generated by SAINT2 Forward. Our score has not been optimized for ranking and this approach is unlikely to outperform any clustering selection protocol (Kryshtafovych ). To make a fairer comparison, it would be ideal to replicate the Baker Group’s decoy selection protocol, but unfortunately their method is not reproducible due to use of human intervention. When considering the best-of-five models output by SAINT2, we were able to predict the correct topology for three cases, one fewer than the Baker Group. When considering the best model output by SAINT2, the correct topology was predicted for eight of these cases. However, we do not know how many cases had at least one model with correct topology across all the decoys produced by the Baker group as this data is not publicly available. Even if we were to consider the best model produced by the Baker Group, the number of decoys produced by their protocol is in the order of hundreds of thousands which far exceeds the 10 000 decoys produced by SAINT2. It may be that a comparison using identical computational time for SAINT2 as that used by the Baker group in CASP12 would be most appropriate. The results in Supplemetary Figure S10 establish that SAINT2 Forward is capable of producing models of comparable quality to those produced by the state of the art. Our findings highlight the importance of re-evaluating search strategies with the advent of increasingly more accurate scoring functions. The way by which predicted contacts are introduced during sampling has an impact on which conformations are sampled. For instance, it has been suggested that using only short-range contacts during the earlier stages of sampling can lead to modeling improvements for some proteins (Kosciolek and Jones, 2014). Due to the sequential nature of our algorithm, N-terminal contacts are introduced earlier and more moves can be dedicated to satisfying these constraints than for the C-terminal contacts. Our approach paves the way for considering different ways in which predicted contacts can be incorporated into structure prediction protocols. Existing structure prediction software can, at times, produce correct models without the use of predicted contacts. We assessed the role of these contacts in the quality of modeling as performed by SAINT2 Forward by testing the protocol without predicted contacts (Supplementary Fig. S12). We find that correct models were produced by SAINT2 Forward without contacts for only 10 of the 18 cases where a correct model was produced by SAINT2 Forward with contacts. These results highlight the importance of accurate contact prediction for successful modeling (de Oliveira ). We observed comparable results when predictions were generated in a biological direction and its reverse. This is consistent with the notion that protein folding is a series of small optimization problems where segments of the chain fold independently (foldons) and then collapse to the complete structure (Hu ; Maity ). Given the amount of experimental evidence to support the notion that proteins are folding as they are being translated (Basharov, 2000; Fedorov and Baldwin, 1997; Giglione ; Holtkamp ; Kolb, 2001; Puglisi, 2015), we have opted to maintain the biological direction as the standard approach in SAINT2. We have demonstrated the validity and applicability of a sequential, pseudo-greedy search heuristic to perform de novo model generation. When drawing an unbiased comparison, sequential prediction requires fewer decoys to produce good answers, can generate individual decoys faster and improves the overall modeling results. Click here for additional data file.

56 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Scoring function for automated assessment of protein structure template quality.

Authors: Yang Zhang; Jeffrey Skolnick
Journal: Proteins Date: 2004-12-01

3. Cotranslational protein folding on the ribosome monitored in real time.

Authors: Wolf Holtkamp; Goran Kokic; Marcus Jäger; Joerg Mittelstaet; Anton A Komar; Marina V Rodnina
Journal: Science Date: 2015-11-27 Impact factor: 47.728

4. Multipass membrane protein structure prediction using Rosetta.

Authors: Vladimir Yarov-Yarovoy; Jack Schonbrun; David Baker
Journal: Proteins Date: 2006-03-01

Review 5. Cotranslational protein folding.

Authors: A N Fedorov; T O Baldwin
Journal: J Biol Chem Date: 1997-12-26 Impact factor: 5.157

6. The Pfam protein families database.

Authors: Marco Punta; Penny C Coggill; Ruth Y Eberhardt; Jaina Mistry; John Tate; Chris Boursnell; Ningze Pang; Kristoffer Forslund; Goran Ceric; Jody Clements; Andreas Heger; Liisa Holm; Erik L L Sonnhammer; Sean R Eddy; Alex Bateman; Robert D Finn
Journal: Nucleic Acids Res Date: 2011-11-29 Impact factor: 16.971

7. RBO Aleph: leveraging novel information sources for protein structure prediction.

Authors: Mahmoud Mabrouk; Ines Putz; Tim Werner; Michael Schneider; Moritz Neeb; Philipp Bartels; Oliver Brock
Journal: Nucleic Acids Res Date: 2015-04-20 Impact factor: 16.971

8. I-TASSER server: new development for protein structure and function predictions.

Authors: Jianyi Yang; Yang Zhang
Journal: Nucleic Acids Res Date: 2015-04-16 Impact factor: 16.971

9. MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins.

Authors: David T Jones; Tanya Singh; Tomasz Kosciolek; Stuart Tetchner
Journal: Bioinformatics Date: 2014-11-26 Impact factor: 6.937