Literature DB >> 32603341

A simple strategy to enhance the speed of protein secondary structure prediction without sacrificing accuracy.

Sheng-Hung Juan1, Teng-Ruei Chen1, Wei-Cheng Lo1,2,3.   

Abstract

The secondary structure prediction of proteins is a classic topic of computational structural biology with a variety of applications. During the past decade, the accuracy of prediction achieved by state-of-the-art algorithms has been >80%; meanwhile, the time cost of prediction increased rapidly because of the exponential growth of fundamental protein sequence data. Based on literature studies and preliminary observations on the relationships between the size/homology of the fundamental protein dataset and the speed/accuracy of predictions, we raised two hypotheses that might be helpful to determine the main influence factors of the efficiency of secondary structure prediction. Experimental results of size and homology reductions of the fundamental protein dataset supported those hypotheses. They revealed that shrinking the size of the dataset could substantially cut down the time cost of prediction with a slight decrease of accuracy, which could be increased on the contrary by homology reduction of the dataset. Moreover, the Shannon information entropy could be applied to explain how accuracy was influenced by the size and homology of the dataset. Based on these findings, we proposed that a proper combination of size and homology reductions of the protein dataset could speed up the secondary structure prediction while preserving the high accuracy of state-of-the-art algorithms. Testing the proposed strategy with the fundamental protein dataset of the year 2018 provided by the Universal Protein Resource, the speed of prediction was enhanced over 20 folds while all accuracy measures remained equivalently high. These findings are supposed helpful for improving the efficiency of researches and applications depending on the secondary structure prediction of proteins. To make future implementations of the proposed strategy easy, we have established a database of size and homology reduced protein datasets at http://10.life.nctu.edu.tw/UniRefNR.

Entities:  

Mesh:

Substances:

Year:  2020        PMID: 32603341      PMCID: PMC7326220          DOI: 10.1371/journal.pone.0235153

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

The secondary structure prediction (SSP) of a protein means to predict its per-residue backbone conformation merely based on the amino acid sequence. This technique has many applications, but there is an increasingly serious problem in its implementation, that is, the rapidly growing time cost. We believe that, if the speed of current SSP methods can be substantially enhanced, all fields relying on this technique will get benefits. This work thus aims to design a general strategy to cut down the time cost of SSP while preserving the accuracy for SSP methods. Currently, performing an accurate SSP for just one protein may take nearly an hour (see Results), greatly hampering large scale post-genomics applications. With the proposed strategy, a prediction with equivalent accuracy can be achieved in minutes. Proteins are the basic functional units of biological systems. The function of a protein is dependent on its structure, which is determined by its amino acid sequence. Theoretically, as long as the structure of a protein is known, we will be able to identify or understand its biological functions. However, as genomes are sequenced much more rapidly than protein structures are solved, most proteins in biological databases nowadays have only sequence information but not structures. Therefore, in this post-genomics era, sequence-based protein structure prediction is of particular importance because by doing that, we may further be able to predict the function of a protein and then achieve many applications. The SSP is a subcategory of protein structure prediction and a key step toward the prediction of the tertiary structure. If the efficiency of SSP can be enhanced, many research and application fields will also be improved, such as the multiple sequence alignment and homology detection for proteins [1-3], the identification of disease-causing genetic mutations or variations [4-6], and the prediction of enzyme target sites [7,8], binding sites [9], antibody epitopes [10-12], protein-protein interactions [13,14] and protein subcellular localizations [15,16]. SSP has been developed for more than 65 years [17], during which the predictive feature set has evolved several times as the algorithms advances and the power of computers increases. In the 1970s, amino acid propensities and residue physiochemical properties were used to perform SSP based on statistical approaches [18]. Soon after that, the concept of “window” enabled statistical analyses of the interaction between features of adjacent residues and achieved ~60% prediction accuracy [19]. As machine learning techniques were applied since the late 1980s, the scale of the feature set has also been expanded. Take the PHD [20], a neural network method, for instance: it used multiple sequence alignment profiles as features to accomplish a three-state (helices, strands, and coils) SSP accuracy of ~70%. The Psipred, another machine learning SSP method, first utilized the position-specific scoring matrix (PSSM) generated by the PSI-BLAST [21] to be the feature set and pushed the three-state accuracy to 76.5% in 1999 [22]. From then on, the PSSM became the major feature set of SSP. Most highly accurate methods nowadays used it, inclusive of the RaptorX [23], SpineX [24], SSpro8 [25], Scorpion [26], Spider2 [27], and DeepCNF [28]. The three-state (Q3) accuracy of these methods was about 80–82%. The SSpro8, Scorpion, and DeepCNF were able to perform the eight-state (Q8) prediction, and the accuracy was around 67–72%. A PSSM is generated by retrieving homologs of a query sequence from a target dataset and then, based on the alignment of these homologous sequences, computing the normalized substitution rates of 20 amino acids and transforming these rates into logarithm scores for each residue. This technique was first applied by Gribskov et al. to the identification of distantly related proteins [29]. With the refinements by 1) the Henikoffs using a weighting scheme to reduce the influence of sequence redundancy [30], 2) Tatusov et al. using a psuedocount algorithm to overcome the insufficiency of homologs from small target datasets [31], and 3) Altschul et al. integrating several protein similarity comparison algorithms into the PSI-BLAST [21], the PSSM became increasingly robust in fields relying on protein sequence analyses, including the SSP. A target dataset is essential for generating a PSSM. In the SSP field, since the big jump of Q3 accuracy made by Psipred, the target dataset utilized by most methods has been the UniRef90 maintained by the UniProt (Universal Protein Resource) [22-28]. Because of the rapid developments in next-generation sequencing technologies and genomics, the size of the UniRef90 dataset has increased dramatically recently. In Feb. 2019, there have been over 90.0 million proteins in this dataset, around 1.7 times the size of two years ago, Feb. 2017 (52.6 million). Although the UniRef90 is truly a great target dataset for SSP, as we keep using this rapidly growing dataset, the disk storage, memory space and computation time required to perform SSP will all dramatically increase as well. Therefore, we would like to examine whether it is necessary to use such a huge dataset, or perhaps there can be a feasible strategy for shrinking the size of the target dataset while preserving the accuracy. We noticed that the size of UniRef90 had increased exponentially in 10 years, but the Q3 accuracy of state-of-the-art SSP methods had stayed at a plateau of 80–82% [17]. Besides, we discovered that the sequence homology of the target dataset would greatly influence the number of zero entries in the probability matrix of a PSSM (see S1 File). Based on these observations, we hypothesized that 1) the size of target dataset would have a greater effect on the time cost than on the accuracy of SSP, and 2) the sequence homology level of the target dataset might affect the quality of the generated PSSM, which would then influence the accuracy of SSP. By random sampling, the target dataset can be shrunk. We created many shrunk versions of the UniRef90 of 2015. Performance evaluations of seven state-of-the-art SSP methods supported our hypothesis. Even when UniRef90 was reduced to 1/16 in size and time cost was dramatically cut down by 93.1%, the average accuracy of those methods only decreased by 1.2%. By reducing the sequence redundancy of the target dataset, the dataset can also be shrunk, and, based on our hypothesis, the accuracy might be changed. Very interestingly, homology reduction experiments revealed that the accuracy was not only changed but improved. When examining how the accuracy was improved, we discovered that the Shannon information entropy might be capable of measuring the quality of a PSSM and explaining the accuracy it produces. Although the accuracy may be decreased because of target dataset size shrinkage, it can be improved by homology reduction. We thus proposed a balanced strategy that a homology reduction of the target dataset to 25% sequence identity accompanied with a size shrinkage to 5 million proteins by random sampling will greatly cut down the time cost while preserving the accuracy of an SSP method. Finally, we evaluated this strategy with state-of-the-art SSP methods using the UniRef90 of 2018. The high accuracy was maintained while the speed was enhanced by 20.9 folds. We hope that this strategy can be widely adopted by current protein secondary structure prediction systems, the performance improvement of which shall be advantageous to researches and applications where SSP plays a key role, especially in this post-genomics era. To sum up, this study aims to find the main factors that influence the speed of SSP and design a strategy to accelerate it. The experimental results not only supported our hypotheses but also helped us discover a shortcut to achieve a remarkable enhancement of the speed while preserving the high accuracy.

Results

Effects of target dataset size on the performance of SSP

We hypothesized that the influence of the target dataset size on the time cost of SSP would be much more significant than on accuracy. Based on this hypothesis, it would be expected that if the target dataset were shrunk, the reduction of computation time would be much more apparent than the decrease in accuracy. In this experiment, we shrank the UniRef90-2015 dataset by random sampling and created subsets with 1/2 size, where k ranged from 0 to 20. The actual sizes of these subsets decreased from 38.2 million, 19.1 million …, to 37 proteins. Seven state-of-the-art SSP algorithms were tested with the TS115 dataset (see [17] and Materials and Methods) as the query set. This experiment was repeated ten times to obtain the average and standard deviation of the performance measures. As demonstrated in Fig 1, all methods showed the same trends. The time cost reduced rapidly as the target dataset size was shrunk, and the difference in time cost between the largest and smallest target datasets could be 3355 folds. On the contrary, except for SSpro8 [25], the decrease of accuracy was negligible for large target datasets, and the maximum difference in accuracy, regardless of the three- or eight-state measures, was less than 14.1%.
Fig 1

The time cost and accuracy of state-of-the-art secondary structure prediction methods assessed with PSSM target datasets of decreasing size.

(A) Scorpion. (B) SpineX. (C) Spider2. (D) Psipred. (E) DeepCNF. (F) RaptorX. (G) SSpro8, template-free mode. (H) SSpro8 with refined PSI-BLAST settings, template-free mode. The query dataset used in this experiment was the TS115 prepared by [17], and the PSSM target sequences were all sampled from UniRef90-2015. The horizontal axis is the extent of target dataset size reduction, where 1 represents the original size of the UniRef90-2015. The vertical axes respectively indicate the time cost (red) and accuracy (blue). The computed accuracy measures include Q3, Q8, SOV3, and SOV8. For clarity, the color codes are displayed only in (A) and (E). All these state-of-the-art SSP methods exhibited the same tend as the target dataset size was reduced. The decrease was much quicker in time cost than in accuracy. To speed up prediction, the SSpro8 adapted a very low E-value setting for the PSSM generator PSI-BLAST; however, we found by setting this, the accuracy was much sacrificed (G). After the setting was modified to be the same with other methods (see S1 Table), the accuracy of SSpro8 was much preserved (H). Using the full-sized UniRef90-2015, for any 3-state predictors, the average time cost for one query protein was ~12 minutes. As for the 8-state predictors, the time costs varied a lot. The DeepCNF was the most accurate algorithm; it took 47 minutes to predict one protein. The experiment was repeated ten times, and standard deviations are plotted on the curves, the deviations of time cost may be too small to see, though (refer to S2 Table for the raw data).

The time cost and accuracy of state-of-the-art secondary structure prediction methods assessed with PSSM target datasets of decreasing size.

(A) Scorpion. (B) SpineX. (C) Spider2. (D) Psipred. (E) DeepCNF. (F) RaptorX. (G) SSpro8, template-free mode. (H) SSpro8 with refined PSI-BLAST settings, template-free mode. The query dataset used in this experiment was the TS115 prepared by [17], and the PSSM target sequences were all sampled from UniRef90-2015. The horizontal axis is the extent of target dataset size reduction, where 1 represents the original size of the UniRef90-2015. The vertical axes respectively indicate the time cost (red) and accuracy (blue). The computed accuracy measures include Q3, Q8, SOV3, and SOV8. For clarity, the color codes are displayed only in (A) and (E). All these state-of-the-art SSP methods exhibited the same tend as the target dataset size was reduced. The decrease was much quicker in time cost than in accuracy. To speed up prediction, the SSpro8 adapted a very low E-value setting for the PSSM generator PSI-BLAST; however, we found by setting this, the accuracy was much sacrificed (G). After the setting was modified to be the same with other methods (see S1 Table), the accuracy of SSpro8 was much preserved (H). Using the full-sized UniRef90-2015, for any 3-state predictors, the average time cost for one query protein was ~12 minutes. As for the 8-state predictors, the time costs varied a lot. The DeepCNF was the most accurate algorithm; it took 47 minutes to predict one protein. The experiment was repeated ten times, and standard deviations are plotted on the curves, the deviations of time cost may be too small to see, though (refer to S2 Table for the raw data). All the assessed methods utilized PSI-BLAST as the sequence alignment search and PSSM generation engine. A finer examination of the SSpro8 method let us recognize that the E-value threshold settings of PSI-BLAST in the SSpro8 pipeline script were very different from other methods. The E-value thresholds of the PSSM generation and alignment search stages set by SSpro8 were 10−10 and 0.001, respectively, whereas the other six methods set these thresholds as 10−3 and 10, respectively. Lower E-value thresholds would make PSI-BLAST discard more statistically insignificant homologs and speed up the generation of PSSM and dataset search; however, they would also sacrifice the quality of PSSM and the precision of search results [32,33], which might lead to the decrease in SSP accuracy. Indeed, after we modified the E-value settings of SSpro8 to be the same with other methods, despite the increase of time cost, all accuracy measures were greatly improved to the equivalent level of others (compare Fig 1G and 1H). This refined setting was thus applied to SSpro8 throughout this study. See S1 Table for the settings of PSI-BLAST for all assessed SSP methods. The results of these methods were averaged and summarized in Fig 2 and S2 Table. The time cost decreased linearly as the target dataset size reduced (note that the horizontal axis is in logarithm). On average, the decrease of accuracy was minimal as long as the target dataset was larger than 1/8 of the UniRef90-2015. The decrease became obvious as the target dataset size was between 1/16 and 1/214 of the UniRef90-2015; for target datasets smaller than this range, the decrease of accuracy slowed down. These results supported our hypothesis, and we found that the curves of time cost (TC) and accuracy of these size reductions could be fitted by the following equations, where n stands for the size (number of proteins) of the target dataset. As illustrated in Fig 2B, these equations agreed well with the experimental results. It is noteworthy that, although the trends demonstrated by these equations might be tangible, the constants associated with them only fitted the TS115 query set, a PSSM target dataset with <90% identities (e.g., the UniRef90), and our hardware/software setup (3.33 GHz CPU + 166 GB RAM + single PSI-BLAST thread; see S1 Table and Materials and Methods).
Fig 2

The average time cost and accuracy of the assessed secondary structure prediction methods as the target dataset size reduces.

(A) The curves of actual experimental results averaged over seven SSP methods. The meanings of the axes of this plot are the same as those of Fig 1, but the scale of the vertical axis of accuracy (blue) is adjusted to start from 0.5 to make visible the standard deviations, which were obtained by ten repeats of random sampling test for each assessed method. Supporting our hypothesis, the decrease in time cost was much faster than that in accuracy. Provided that the target dataset sampled from the UniRef90-2015 was larger than ~5 million sequences (indicated by the dotted vertical line), the decrease in accuracy was minor. (B) The curves of fitted Q3 and Q8 accuracies. The horizontal axis indicates the number of proteins in the PSSM target dataset (M: million, K: thousand). The points of accuracies are drawn by the actual data, but the curves are made by Eqs (2) and (3), showing that those equations fit well with the experimental results.

The average time cost and accuracy of the assessed secondary structure prediction methods as the target dataset size reduces.

(A) The curves of actual experimental results averaged over seven SSP methods. The meanings of the axes of this plot are the same as those of Fig 1, but the scale of the vertical axis of accuracy (blue) is adjusted to start from 0.5 to make visible the standard deviations, which were obtained by ten repeats of random sampling test for each assessed method. Supporting our hypothesis, the decrease in time cost was much faster than that in accuracy. Provided that the target dataset sampled from the UniRef90-2015 was larger than ~5 million sequences (indicated by the dotted vertical line), the decrease in accuracy was minor. (B) The curves of fitted Q3 and Q8 accuracies. The horizontal axis indicates the number of proteins in the PSSM target dataset (M: million, K: thousand). The points of accuracies are drawn by the actual data, but the curves are made by Eqs (2) and (3), showing that those equations fit well with the experimental results. To examine whether the trends of these equations were robust when independent query sets were applied, we performed the same experiment on the CASP12 and CASP13 query sets. Just like the TS115, these datasets were composed of novel protein structures deposited in the PDB (Protein Data Bank) [34] after the assessed SSP methods were published. As summarized in S1 Fig and S2 Table, the trends of time cost and accuracy obtained based on the results of CASP12 and CASP13 agreed well with Eqs (1)–(3). The relationship between the time cost and the target dataset size was linear, while that between the accuracy and the target dataset size was sigmoid. For example, the equations of the time cost were, where they were both in the form of a classic linear function (y = mx + b). The slope (or, the m) of CASP13 was larger than TS115 and CASP12 because CASP13 contained more large proteins than the latter two. Because TS115 contains many more proteins than the CASP datasets, in this report, we use the results of TS115 as the mainstream to explain our discoveries and make those of the CASPs available in the Supporting information files. The conclusions of this study are all applicable to the three query sets.

Effects of target dataset sequence redundancy on the performance of SSP

In addition to sampling, another way to shrink a target set is homology reduction, that is, making sequence identity non-redundant (NR) subsets. In general, a homology-reduced subset of lower identity contains fewer sequences. Since the target dataset size decreases as the sequence redundancy lowers, based on our first hypothesis and the results of the first experiment, it can be sure that the time cost of SSP will be cut down. As for whether the accuracy will decrease as the redundancy lowers, the situation is more complicated. The second hypothesis of this study says that the homology level of the target dataset would affect the quality of PSSM and thus the accuracy of SSP. Accordingly, it would be expected that as the sequence redundancy decreases, the accuracy might diverge from the curves of Eqs (2) and (3), which consider only the dataset size. The question is: Will the accuracy be higher or lower than the curves? Since we had observed that a PSSM of low complexity might produce decreased accuracy and that the raw probability values of a PSSM generated from homologous sequences of high redundancy are usually less complicated than those generated from non-redundant sequences (S1 File), we supposed that homology reduction of the target dataset might improve SSP accuracy. Using UniRef90-2015 as the source of target sequences, we created a series of homology reduced target datasets to perform this experiment (Materials and Methods). The sequence identity of these non-redundant datasets ranged from 80% to 25%. Fig 3 shows the average time cost and accuracy of the seven state-of-the-art SSP algorithms tested with TS115 as the query dataset (see S2 Fig and S3 Table for the plots and raw data of individual algorithms). In this figure, the curves made based on Eqs (1)–(3) were provided too. Agreed again with our first hypothesis, even when the target dataset was shrunk because of homology reduction, the time cost dropped. More interestingly, the Q3 and Q8 both went higher than the curves of Eqs (2) and (3), not only supporting our second hypothesis but implying that a low homology helps improve the accuracy. The CASP12 and CASP13 query datasets were also applied to perform this experiment, and the conclusion obtained from them was consistent with the above (refer to S3 Fig for the average performance and S3 Table for the raw data).
Fig 3

The average time cost and accuracy of the assessed secondary structure prediction methods as the sequence homology of the target dataset reduces.

The horizontal axis is the extent of homology reduction in sequence identity non-redundancy. For example, NR90 means that any two sequences in the PSSM target dataset share <90% identities. The lowest sequence identity was 25% because when a lower identity such as 20% was applied, the number of sequences remaining in the target dataset would be far smaller than the 5 million safety zone (see Fig 2 and S6 Table) and then cause unreasonable decreases of the accuracy. The right vertical axis is the time cost (red), and the left is the accuracy (blue). The dotted red curve plots the expected time cost according to the Eq (1) formulated in the experiment of target size reduction. Because as the homology of the target dataset was reduced the dataset was also significantly shrunk, the time cost of SSP methods decreases rapidly. The dotted light and dark blue curves were the expected values of Q3 and Q8 made according to the size-reduction accuracy Eqs (2) and (3), respectively, while the solid blue curves showed the actual values. Very interestingly, as the target dataset homology reduced, the accuracy increased slightly.

The average time cost and accuracy of the assessed secondary structure prediction methods as the sequence homology of the target dataset reduces.

The horizontal axis is the extent of homology reduction in sequence identity non-redundancy. For example, NR90 means that any two sequences in the PSSM target dataset share <90% identities. The lowest sequence identity was 25% because when a lower identity such as 20% was applied, the number of sequences remaining in the target dataset would be far smaller than the 5 million safety zone (see Fig 2 and S6 Table) and then cause unreasonable decreases of the accuracy. The right vertical axis is the time cost (red), and the left is the accuracy (blue). The dotted red curve plots the expected time cost according to the Eq (1) formulated in the experiment of target size reduction. Because as the homology of the target dataset was reduced the dataset was also significantly shrunk, the time cost of SSP methods decreases rapidly. The dotted light and dark blue curves were the expected values of Q3 and Q8 made according to the size-reduction accuracy Eqs (2) and (3), respectively, while the solid blue curves showed the actual values. Very interestingly, as the target dataset homology reduced, the accuracy increased slightly. To understand how the homology of the target dataset influenced SSP accuracy, we tried to quantify the complexity of the PSSM by applying the Shannon information entropy [35] widely used to measure the disorder of data of a variable. A higher information entropy computed from a PSSM stands for a higher disorder of the probability matrix, which we suppose to reflect a higher complexity of the PSSM. As shown in Fig 4, where the entropy and Q3 accuracy of target datasets with decreasing homology are drawn in one plot, the two variables seemed to have some correlation. The Pearson’s correlation coefficient between them was 0.417, a positive relationship. Similarly, the correlations between them obtained from the CASP12 and CASP13 tests were positive (S4 Fig). However, because as we reduced the homology the target dataset size was passively shrunk as well (meaning that there were two changing factors in this experiment), the actual correlation between entropy and accuracy should be carefully reexamined by fixing one of the two factors.
Fig 4

The average information entropy and Q3 accuracy of the assessed secondary structure prediction methods as the sequence homology of the target dataset reduces.

The horizontal axis is the extent of target dataset homology reduction. The right vertical axis is the Shannon information entropy (green), and the left is the Q3 accuracy (blue). These two curves showed similar trends. The Pearson’s correlation coefficient was 0.417, a positive relationship.

The average information entropy and Q3 accuracy of the assessed secondary structure prediction methods as the sequence homology of the target dataset reduces.

The horizontal axis is the extent of target dataset homology reduction. The right vertical axis is the Shannon information entropy (green), and the left is the Q3 accuracy (blue). These two curves showed similar trends. The Pearson’s correlation coefficient was 0.417, a positive relationship.

Effects of sequence reduction of target dataset with a fixed dataset size on the performance of SSP

Here we selected a fixed dataset size to dissect further the influence of target sequence homology on the performance of SSP. Referring to Fig 2 and Eqs (2) and (3), we found that the decrease of accuracy was negligible as long as the target dataset was larger than 1/8 of the UniRef90-2015, e.g., ~5 million sequences. Thus, we decided to use this number of proteins as the fixed dataset size to repeat the previous experiment. Because this size was smaller than the NR datasets of the previous experiment, multiple repeats by random sampling could be applied. According to our hypotheses and the results of the above experiments, when the TS115 was applied as the query dataset, we expected that the time costs of these size-reduced NR datasets would all be similar while the accuracy would go higher than the curves of Eqs (2) and (3). The results summarized in Fig 5 demonstrate that, when the target dataset size was fixed at 5 million proteins, the average time cost of the assessed methods stayed steady with a slight decrease as the homology reduced from 90% to 25% identity (see S5 Fig for the plots of particular algorithms). Importantly, the accuracy values were all higher than the values depicted according to Eqs (2) and (3). Although the average differences in Q3 and Q8 between the 90% and 25% NR datasets were both only 0.70%, the average p-values were <1.25×10−6 and <9.99×10−8, respectively (see S4 Table for the raw data and p-value of each method), indicating that these improvements in accuracy were statistically significant. Moreover, this improvement compensated for the decrease in accuracy caused by the shrinkage of the target dataset. Comparing S2 and S4 Tables, the Q3 achieved by the 5-million-protein target dataset of 25% identity was 0.1% higher than the Q3 achieved by the full UniRef90-2015 on average. As for the average Q8, the value accomplished by the former dataset was only 0.1% lower than that accomplished by the latter. The CASP query sets were also tested. On average, the Q3 and Q8 of them obtained with the UniRef25-2015 of 5 million proteins were merely 0.3‰ and 0.4% lower than those obtained with the full UniRef90-2015, respectively (S6 Fig and S4 Table). According to these data and those shown in Fig 3, we concluded that homology reduction of the target dataset could improve SSP accuracy. Even if the accuracy might be lowered by size reduction of the target dataset, the homology reduction could counteract the effects and restore/improve the accuracy.
Fig 5

The average time cost and accuracy of the assessed secondary structure prediction methods, as the sequence homology of the target dataset reduces with a fixed dataset size.

(A) The results of time cost, Q3, and Q8. (B) The results of Q3 and Q8, with the axis of accuracy zoomed in to make visible the standard deviations. In both plots, the dotted lines indicate the expected time cost or accuracy, according to the Eqs (1)–(3) formulated by target dataset size reduction. The size of these target datasets was fixed as 5 million proteins. The actual time costs (the solid red curve) by the datasets of low homology were slightly lower than the predicted, indicating that homology reduction of the target dataset can slightly improve the speed of SSP. More interestingly, the accuracies achieved by low homology sets were not only higher than the values predicted based on Eqs (2) and (3) but also significantly higher than the accuracies achieved by high homology sets. These data were averaged over seven methods. All p-values of these methods for the difference in Q3 or Q8 between the NR25 and NR90 datasets were <10−5. To conclude, homology reduction of the PSSM target dataset helps improve SSP accuracy.

The average time cost and accuracy of the assessed secondary structure prediction methods, as the sequence homology of the target dataset reduces with a fixed dataset size.

(A) The results of time cost, Q3, and Q8. (B) The results of Q3 and Q8, with the axis of accuracy zoomed in to make visible the standard deviations. In both plots, the dotted lines indicate the expected time cost or accuracy, according to the Eqs (1)–(3) formulated by target dataset size reduction. The size of these target datasets was fixed as 5 million proteins. The actual time costs (the solid red curve) by the datasets of low homology were slightly lower than the predicted, indicating that homology reduction of the target dataset can slightly improve the speed of SSP. More interestingly, the accuracies achieved by low homology sets were not only higher than the values predicted based on Eqs (2) and (3) but also significantly higher than the accuracies achieved by high homology sets. These data were averaged over seven methods. All p-values of these methods for the difference in Q3 or Q8 between the NR25 and NR90 datasets were <10−5. To conclude, homology reduction of the PSSM target dataset helps improve SSP accuracy. We also computed the information entropy of the PSSM generated from these size-fixed NR target datasets (Fig 6). Compared with Fig 4, the trend of entropy got much more similar to Q3 as the homology of the target set decreased. The Pearson’s correlation coefficient between them was 0.983, a strong relationship. The correlation coefficients between entropy and Q8, SOV3, and SOV8 were 0.947, 0.970, and 0.917, respectively. Consistent with these TS115 results, the correlation coefficients between the entropy and Q3 obtained with the CASP12 and CASP13 were 0.966 and 0.948, respectively, and the coefficients between the entropy and other accuracy measures all indicated strong positive correlations (>0.863; see S7 Fig). These results suggested that the information entropy of a PSSM may help explain the SSP accuracy it produces.
Fig 6

The average information entropy and Q3 of the assessed secondary structure prediction methods, as the homology of the target dataset reduces with a fixed dataset size.

The horizontal axis is the extent of homology reduction. The color codes are the same as Fig 4. When the target dataset size was fixed at 5 million sequences, i.e., the boundary of the safety zone (Fig 2), the PSSM information entropy and SSP accuracy showed a strong correlation, implying that the entropy may be helpful to explain how the quality of a PSSM influences the accuracy of SSP.

The average information entropy and Q3 of the assessed secondary structure prediction methods, as the homology of the target dataset reduces with a fixed dataset size.

The horizontal axis is the extent of homology reduction. The color codes are the same as Fig 4. When the target dataset size was fixed at 5 million sequences, i.e., the boundary of the safety zone (Fig 2), the PSSM information entropy and SSP accuracy showed a strong correlation, implying that the entropy may be helpful to explain how the quality of a PSSM influences the accuracy of SSP.

The proposed strategy for the speed enhancement of SSP–with performance assessments

We discovered that, although shrinking the target dataset would decrease the accuracy, as long as the dataset is large enough, the decrease is minimal. Meanwhile, given a fixed target dataset size, homology reduction would increase the accuracy, especially at low identity levels. Taken together, here we proposed a strategy to accelerate SSP without sacrificing the accuracy, that is, using a size and homology both reduced UniRef dataset to serve as the PSSM target dataset instead of the full UniRef90. The recommended extent of reduction is 5 million proteins with <25% sequence identities. If for some studies the accuracy is much more critical than the speed, an alternative strategy is simply just the homology reduction of the target dataset. According to the results demonstrated in Fig 3, it is expected that, if the size of the target dataset is much larger than 5 million proteins, a homology reduction at 70% sequence identity is sufficient to enhance the accuracy. All the above experiments were performed using target sequences released no later than 2015. To test whether the proposed strategy is promising for future applications, we assessed it with the UniRef90 of the year 2018 as the source of target sequences. We expected that 1) by using Eqs (1)–(3), the time cost and accuracy of the full UniRef90-2018 could be well predicted, 2) the time cost could be significantly cut down, and 3) the accuracy achieved by the size-homology-reduced target dataset, the UniRef25-2018, would be close to that by the full UniRef90-2018. As shown in Table 1, the equation-predicted time cost and accuracy for the full UniRef90-2018 agreed well with the actual results. By reducing the size of the target dataset to 5 million proteins, the speed was increased by 20.9 folds. Importantly, all the accuracy measures achieved by the size-homology-reduced UniRef25-2018 were close to those by the enormous full UniRef90-2018. As for the alternative strategy, when only a 70% identity homology reduction was applied such that the target dataset was passively shrunk to 46.2 million proteins (UniRef70-2018), in addition to a 2.1-fold speed improvement, the three-state and eight-state accuracies were enhanced by 0.5–0.8% and 0.1–0.2%, respectively. See S5 Table for detailed data and the results of particular SSP methods.
Table 1

Predicted and actual performances of state-of-the-art SSP methods assessed with the original and manipulated UniRef 2018 target datasets.

Target datasetFull UniRef90-2018 (87.3 million proteins)Full UniRef70-2018 (46.2 million proteins)Sampled UniRef25-2018 (5.0 million proteins)
PredictedaActual
Running time (sec)2507.5752438.778b1181.185c116.938 ± 0.890b,c
Q30.8070.8060.8110.804 ± 0.002
SOV30.7590.7560.7640.755 ± 0.003
Q80.6880.6900.6910.683 ± 0.002
SOV80.6550.6550.6570.646 ± 0.003

aThe query set of this experiment was the TS115. The running time, or time cost, is predicted by Eq (1). The Q3 and Q8 were predicted by Eqs (2) and (3), respectively. The predicted SOV values were computed by fitting the curves of SOV shown in Fig 2A.

bAll these performance measures were averaged over seven SSP methods, inclusive of this time cost. On average, using the full UniRef90-2018, it took 41 minutes to predict one protein. Accurate methods required a longer time. For instance, the most accurate method DeepCNF took 93 minutes for one protein.

cIf predicted by Eq (1), the time costs of the full UniRef70-2018 and the sampled 5-million-protein UniRef25-2018 datasets would be 1330 and 149 sec, respectively. The actual time costs of them were only ~90% and ~80% of the predicted values, respectively, demonstrating again that homology reduction of the target dataset also slightly improves the speed of SSP.

aThe query set of this experiment was the TS115. The running time, or time cost, is predicted by Eq (1). The Q3 and Q8 were predicted by Eqs (2) and (3), respectively. The predicted SOV values were computed by fitting the curves of SOV shown in Fig 2A. bAll these performance measures were averaged over seven SSP methods, inclusive of this time cost. On average, using the full UniRef90-2018, it took 41 minutes to predict one protein. Accurate methods required a longer time. For instance, the most accurate method DeepCNF took 93 minutes for one protein. cIf predicted by Eq (1), the time costs of the full UniRef70-2018 and the sampled 5-million-protein UniRef25-2018 datasets would be 1330 and 149 sec, respectively. The actual time costs of them were only ~90% and ~80% of the predicted values, respectively, demonstrating again that homology reduction of the target dataset also slightly improves the speed of SSP. To examine the feasibility of the proposed strategy thoroughly, we now tested it oppositely by using an extremely large PSSM target dataset with sequence homology higher than 90% identity to assess the performance of the state-of-the-art SSP methods. The non-redundant protein dataset maintained by the NCBI (National Center for Biotechnology Information) collected the most comprehensive protein sequence data [36] and was also utilized by some modern SSP methods as the default target dataset, such as the Psipred [22]. At the time of this article, the NCBI NR dataset contained 257.1 million sequences sharing <100% identities. In order to predict the time cost and accuracy for this NrNCBI100-2020 dataset, the results of homology reduction experiments on the UniRef 2015 and 2018 obtained with the TS115 query set were combined (see S6 Table) to fit the curves shown below, where TC, Q3, and Q8 stand for, respectively, the time cost, Q3, and Q8 accuracy of the homology-reduced target dataset, n denotes the number of target sequences, and c means the sequence identity cutoff of the target dataset. These equations are very different from Eqs (1)–(3) because, in this experiment, the homology and size of the target dataset may both change. According to these equations, when the state-of-the-art SSP methods were performed with TS115 as the query set and NrNCBI100-2020 as the target dataset, the average time cost, Q3, and Q8 would be 5,571 sec, 0.799, and 0.682, respectively, where n = 257.1 million (proteins) and c = 100 (% identity). In other words, it was expected that the time cost of using NrNCBI100-2020 would be much higher than using UniRef90-2015/2018 because of its large size and that the accuracy obtained with NrNCBI100 would be lower than the UniRef90 datasets because of the increased sequence redundancy. After the real test, the time cost, Q3, and Q8 were 5,756 sec, 0.789, and 0.676, respectively (see S7 Table for the raw data of particular methods), very close to the expected values. These results supported that using a low homology target dataset with a suitably-shrunk size is the right direction for improving the efficiency of SSP.

Discussion

On the speed enhancement by size reduction of the target dataset

The results of all experiments in this study support our first hypothesis that the size of the target dataset exerts a much greater influence on the time cost of SSP than accuracy. Although the average time cost of the assessed SSP algorithms could be well fitted by Eq (1), there were some factors not considered yet. The pipeline of SSP methods nowadays typically comprise two stages, 1) the generation of PSSM by PSI-BLAST sequence similarity search against the target dataset, and 2) the prediction by machine learning with the PSSM as predictive features. The computation time is the sum of the two. A comprehensive analysis of the computation time of the assessed methods reveals that the former is the major part. As exemplified in Table 2 (see S1 Table for full data), the PSSM generation time of all methods is linearly proportional to the size of the target dataset, meaning time complexity T ∝ O(n) where n is the number of proteins in the dataset, agreed with [32]. The time cost of prediction, however, almost remains constant regardless of the dataset size, except for DeepCNF [28]. The main reason why the prediction time of DeepCNF is much longer than that of other algorithms and rises as the target dataset enlarges is that it conducts another round of sequence similarity search against the target dataset in the prediction stage [28].
Table 2

The PSSM generation and secondary structure prediction time of the assessed SSP methods.

MethodTarget dataset size (number of proteins)
0.15M0.30M0.60M1.19M2.39M4.78M9.55M19.10M38.20M
ScorpionPSSM2.274.569.2218.5238.1179.30164.01344.55719.39
Pred22.7622.8322.8522.7222.7522.8122.7922.8322.90
SpineXPSSM2.284.559.1118.3937.9479.41164.00344.74720.54
Pred5.945.935.935.955.965.955.985.985.99
Spider2PSSM2.845.229.8719.1738.7580.35164.58424.24914.54
Pred1.341.171.060.950.930.910.880.931.03
PsipredPSSM2.254.549.1118.4638.1478.99163.26343.04721.06
Pred0.150.150.150.150.150.150.150.150.15
DeepCNFPSSM2.244.559.1618.5538.2679.94165.24346.57722.20
Pred16.8429.4554.59110.76207.12382.16625.981061.382114.31
RaptorXPSSM3.316.8914.2329.7161.78132.45280.70606.791271.95
Pred0.900.900.900.900.900.900.910.910.91
SSpro8PSSM1.733.697.8016.6135.4376.30160.73342.09714.13
Pred4.384.674.955.195.375.555.695.865.92
SSpro8+PSSM2.324.629.1718.4237.8879.08163.87344.39724.29
Pred6.156.166.146.126.126.136.136.136.14

The unit of these data is second. Each PSSM generation and Prediction time was measured and averaged with ten repeats. The query set applied here was the TS115. For the results of the CASP query sets, or the results of the target datasets with <0.15 million proteins, please see S2 Table.

The unit of these data is second. Each PSSM generation and Prediction time was measured and averaged with ten repeats. The query set applied here was the TS115. For the results of the CASP query sets, or the results of the target datasets with <0.15 million proteins, please see S2 Table. Because the majority of SSP algorithms spend most of the computation time in generating PSSM, any strategy capable of reducing the PSSM generation time will speed up the whole process. The SSpro8, for example, reduces its time cost by setting higher E-value thresholds for the PSI-BLAST in sequence similarity search. However, this adjustment does not save much time; for target sets larger than a million proteins, it only cuts down the time cost by 4.6% on average. Meanwhile, it dramatically sacrifices the accuracy (Fig 1 and S2 Table). We have proposed a simple strategy to enhance the speed of SSP without sacrificing accuracy: shrinking the target dataset by homology reduction and random sampling. We have shown that, although size reduction of the target dataset will decrease the accuracy, homology reduction will increase it. By choosing a suitable dataset size, like the recommended 5 million proteins, the effects of size and homology reduction will be neutralized. Besides, by fixing the size of the target dataset, as the UniRef up to date becomes more massive, more running time will be saved. We have demonstrated that for the UniRef90-2015 (38.2 million) and UniRef90-2018 (87.3 million) datasets, our strategy speeds up SSP by 9.3 and 20.9 folds, respectively. It is expected that, in 2020, when the UniRef90 grows to ~160 million proteins, using this strategy will enhance the SSP speed by around 40 folds. At present, using an accurate algorithm to predict the secondary structure of one protein may take nearly an hour. Considering that the time cost will grow in proportion to the size of UniRef90, the speed enhancement achieved by this strategy shall be highly valuable, especially for large-scale studies such as the functional assignment of hypothetical proteins, structural annotation of proteins determined by structural genomics projects, predicting or analyzing protein interactomes, and other post-genomics applications like our computer-aided protein engineering system (see Future works).

On the effect of homology reduction of target dataset on the performance of SSP

Reducing the homology of the target dataset will accelerate SSP; this is mainly because the dataset is shrunk consequently. Demonstrated in Figs 3 and 5, the homology reduction itself exerts improving but minor effects on the speed. As for the accuracy, the effects of target set homology reduction seemed minor, too; specifically, there was only a ≤0.7% increase in either Q3 or Q8. However, it would be too hasty to conclude that the homology of the target dataset has no or little effect on accuracy. After all, the accuracies obtained in these experiments were all higher than those predicted by the functions of target dataset size, Eqs (2) and (3). Especially in the experiment of Fig 5, where the only factor that might influence the accuracy was the sequence homology, it was evident that the difference in accuracy between the 90% and 25% homology datasets was statistically significant (p-values < 10−5). These small but significant improvements led us to infer that the influence of the homology of the target dataset on SSP accuracy might not be small but just difficult to detect under large dataset size. The PSSM of a query sequence is generated based on the homologs identified by PSI-BLAST in its iterative similarity search against the target dataset. During the search, PSI-BLAST, by default, only considers the top 500 hits ranked by the E-value [21]. When the dataset is enormous, no matter at what homology level we have tested, for most query proteins, the number of homologs might always be far more than 500 and make the produced PSSMs generally similar. The influence of sequence homology might thus be barely detectable. If smaller target datasets were applied, the effects of homology reduction on SSP accuracy might be more prominent, and we expected that such effects could be detected by measuring the information entropy of the PSSM. More studies shall be conducted to examine this inference.

On the relationship between the information entropy of PSSM and the accuracy of SSP

Based on the experiments of Figs 4 and 6, we found that the Shannon entropy of PSSM is positively correlated with SSP accuracy, especially when there was only one changing factor. In Fig 6, the target dataset size was fixed while the homology was decreasing, and the correlation coefficients between the entropy and all accuracies were ≥0.917. To further test whether the entropy can be applied to explain the accuracy accomplished by a PSSM, we now turn to the first experiment again. In that experiment, while the target dataset size was decreasing, the homology of datasets was fixed (90% identity)–meaning the target dataset size is the only changing factor. Since the size of the target dataset exerts a more significant influence on the accuracy than the homology does, the correlation obtained here is expected to be stronger than that of Fig 6. Indeed, as shown in Fig 7 (query: TS115) and S8 Fig (query: CASP12, CASP13), when the entropy is plotted with the accuracy, both curves similarly go down deep as the dataset shrinks. Regarding the TS115 query set, the correlation coefficient between the entropy and Q3 is 0.997, and the coefficients for Q8, SOV3, and SOV8 are all 0.996. As for the CASPs, the correlation coefficient between the entropy and any accuracy measure is ≥0.979. In general, the Shannon entropy is associated with the extent of concentration of the probability distribution of a variable. If the distribution is concentrated on just a few values, the entropy is low. High entropy is obtained when the distribution is dispersed. Intuitively, a PSSM of low entropy looks simple because the observed probabilities for most amino acids are zero. In contrast, a PSSM of high entropy looks complicated (see S1 File). Hence, our data imply that 1) when the homology of the target dataset is fixed, the complexity of the PSSM will increase as the dataset size increases and, 2) when the target dataset size is fixed, the complexity of the PSSM will increase as the homology decreases. We suppose that the reason why the entropy rises as the homology reduces is that from a low-redundant target dataset, the PSI-BLAST can retrieve a more divergent set of homologs than from a highly-redundant dataset. According to the algorithm of PSSM, the more divergent the retrieved sequences, the more complicated the position propensity matrix will be constructed [21,30]. Hence, a PSSM of divergent homologs will produce high entropy. Since a higher entropy stands for more encoded information, the amount of information in the SSP feature set (the PSSM) may increase and eventually facilitate machine learning and predictions, probably explaining why the accuracy is improved as the homology of the target dataset decreases.
Fig 7

The average information entropy and Q3 of the assessed secondary structure prediction methods, as the size of the target dataset reduces at a fixed homology level.

The horizontal axis indicates the size of the PSSM target datasets (M: million, K: thousand). The color codes of entropy and Q3 are the same as Fig 4. The sequence homology of the target datasets, e.g., randomly-sampled subsets of the UniRef90-2015, were all the same. As the homology was fixed, the curves of the information entropy of PSSM and the accuracy of SSP showed very similar trends. The correlation coefficient between these two variables was 0.997.

The average information entropy and Q3 of the assessed secondary structure prediction methods, as the size of the target dataset reduces at a fixed homology level.

The horizontal axis indicates the size of the PSSM target datasets (M: million, K: thousand). The color codes of entropy and Q3 are the same as Fig 4. The sequence homology of the target datasets, e.g., randomly-sampled subsets of the UniRef90-2015, were all the same. As the homology was fixed, the curves of the information entropy of PSSM and the accuracy of SSP showed very similar trends. The correlation coefficient between these two variables was 0.997. In most experiments, we applied size and/or homology reduction to the PSSM target dataset, except for the one on the NCBI NR dataset (i.e., the NrNCBI100-2020). The fact that the accuracy obtained with this vast target dataset of high homology was much lower than the accuracy obtained with UniRef90-2015 or UniRef90-2018 could be well reflected by the entropy. The average information entropy of PSSM of the TS115 query set produced with NrNCBI100-2020 was 1.836, much smaller than that produced with UniRef90-2015 (2.516) or UniRef90-2018 (2.502). Although the information entropy seems promising to explain how the quality of a PSSM influences the accuracy of SSP, the entropy is simultaneously affected by the size and homology of the target dataset, and this complicated situation has made the correlation between the entropy and accuracy not always strong (Fig 4). In this work, we proposed a feasible way to assess the quality and SSP performance of a PSSM. Nevertheless, dissecting the detailed mechanism underlying the relationship between information entropy and SSP accuracy remains challenging. To our knowledge, there have been similar findings reported previously. In the study by Wang et al. [23], a measure termed Neff was defined to quantify the quality of a PSSM, and this measure also positively correlated with SSP accuracy. Judging from the formula provided in [23], the Neff was, in fact, a derivative of the Shannon information entropy.

Future works

We have long been interested in computer-aided protein engineering and developing novel bioengineering techniques by applying protein structural phenomena like circular permutation and three-dimensional domain swapping. Previously we have developed several structure-based algorithms to study these phenomena [37-41]. However, most known proteins have no structural data yet, but only amino acid sequences. To strengthen our bioengineering platform, we have been developing sequence-based algorithms to identify and analyze these phenomena and found secondary structure prediction a great help. Fortunately, convenient SSP algorithms have been available, inclusive of the excellent ones utilized in this study [23-28]. Thanks to these algorithms, our researches could be moved forward. Nevertheless, to test our sequence-based bioengineering platform involved a vast number of proteins, and accurate SSP predictors might take an hour to process just one (Table 1). As we tried to find solutions to enhance efficiency, we formed hypotheses, performed experiments, and eventually figured out a way to reduce the time cost significantly while preserving high accuracy. Shortly, we will apply the proposed strategy and state-of-the-art SSP methods to the sequence-based prediction of the circular permutation sites for proteins, the sequence-based detection of 3D domain swapping, and the enhancement of an SSP-driven template search algorithm for protein structure modelling [42] that can be applied to predict the structure of circularly-permuted or domain-swapped proteins.

The database of homology reduced datasets of the UniRef

To our knowledge, there are not yet web resources providing UniRef non-redundant sets with sequence identity <50%. However, according to our experimental results, the recommended identity of the PSSM target dataset is 25%. Since most homology clustering software applicable to massive datasets (like the CD-HIT [43]) does not support identities lower than 40%, and most software capable of making non-redundant sets of low identities (such as the 32-bit USEARCH [44]) cannot process massive datasets, it may not be easy to create a homology-reduced UniRef dataset of 25% identity. Although the homology reduction procedure we described in Materials and Methods is applicable, without a machine cluster supported by an efficient distributed computation system, it may take weeks to create a 25% identity NR set from the source UniRef90. To assist researchers in applying the proposed strategy, we have established a web-based database providing NR sets of the UniRef with 25% as the lowest identity level. All the experimental datasets of this study are provided as well. This database will be updated semiannually to keep up with the growth of UniRef and is available at http://10.life.nctu.edu.tw/UniRefNR.

Conclusions

Based on the observations that the improvements in SSP accuracy are limited regardless of the rapid growth of protein sequences in recent years, and that the amount of zero probability of a PSSM would influence SSP accuracy, we hypothesized that 1) the number of target sequences might have a greater effect on the time cost than on the accuracy of SSP, and 2) the homology of target sequences might affect the quality of generated PSSM as well as the SSP accuracy. Accordingly, it was expected that 1) size reduction of the target dataset would substantially cut down the time cost without much degrading the accuracy, and 2) homology reduction of the target dataset would increase the complexity of the PSSM and improve SSP accuracy. Experimental data of size/homology reductions of the target dataset agreed with the expected results and thus supported our hypotheses. We also discovered that the Shannon information entropy could measure the complexity of a PSSM and might help explain the accuracy it produces. Based on these findings, we proposed two strategies to speed up SSP without sacrificing accuracy or even with enhancements: 1) a homology reduction of the target dataset accompanied by a size reduction to millions of proteins, or 2) merely a homology reduction of the target dataset. Tested with the UniRef of 2018, the first strategy reduced the average time cost of state-of-the-art SSP algorithms from 40.6 to 1.9 min with a <0.7% decrease in Q3 or Q8 accuracy. As for the second strategy, the time cost was reduced to 48.4% while the accuracy was increased up to 0.8%. This study proves that SSP applications do not need to keep using the huge UniRef90 target dataset, which is exponentially growing and extremely challenging the computing power and storage capacity of researchers. For large-scale post-genomics applications of SSP, using the proposed strategy not only will save much time but may increase the reliability of data because of the improvement in SSP accuracy.

Materials and methods

Experimental environments

This study was carried out using three server machines. All experiments were performed on one machine equipped with two hyperthreading 6-core (Intel Xeon) 3.33 GHz processors and 166 GB memory. The other two possessed two hyperthreading 4-core (Intel Xeon) 2.27 GHz processors and four 12-core (AMD Opteron) 2.20 GHz processors, respectively. They were all clustered with a distributed computation system we developed [41,45] to speed up the production of homology-reduced UniRef datasets and sustain the periodic updating of the database constructed in this work.

Experimental datasets

For assessing the performance of secondary structure prediction methods, a set of query sequences and a target dataset for generating PSSM are required. Since the state-of-the-art SSP methods assessed in this study were all developed before 2016, meaning their predictive models were all trained with proteins released at the PDB no later than Dec. 2015. Hence, an ideal query dataset for evaluating prediction accuracy should be composed of proteins released after Jan. 1st 2016. In a review by Dr. Yaoqi Zhou [17], an independent test query dataset TS115 (115 proteins) constituted with proteins of PDB 2016 had been established and was, therefore, a reasonable choice for this study. According to [17], the sequence identity between TS115 and the protein structures released in PDB before 2016 was ≤30%. Besides, the CASP12 (46 proteins) and CASP13 (43 proteins) datasets obtained from the 12th and 13th biannual meeting of Critical Assessment of Structure Prediction techniques [46], which also comprised novel protein structures determined after Jan. 2016, were used as the query sets. As for the source of target sequences, the UniRef90 of the year 2015 (UniRef90-2015; 38.2 million proteins) and 2018 (UniRef90-2018; 87.3 million proteins) established by the UniProt [47], and the non-redundant protein dataset prepared by the NCBI in 2020 [36] (NrNCBI100-2020; 257.1 million proteins), were utilized.

Applied secondary structure prediction algorithms

Several SSP algorithms were applied to perform the experiments. The standalone programs of them were all released before 2016, including three-state algorithms: Psipred (v3.3) [22], SpineX (v2.0) [24], Scorpion (v1.0) [26], and Spider2 (v2.0) [27], and eight-state algorithms: RaptorX (v1.0) [23], SSpro8 (v5.2) [25], and DeepCNF (v1.02) [28]. The original packages of these programs recruited different versions of PSI-BLAST as the PSSM generator. In order to make fair assessments, we modified their scripts to uniformly use the psiblast program of NCBI blast 2.6.0 [21]. For each algorithm, the parameters of PSI-BLAST were set according to their original scripts, except for the SSpro8 (see Results). The parameter settings of these algorithms are available in S1 Table.

Reduction of target dataset size by random sampling

Size-reduced target datasets were generated based on the UniRef90 by random sampling. Because an experiment was repeated 10 times for each test group, when the size of a shrunk target dataset was smaller than 1/10 of the UniRef90, the sampling could be done perfectly without replacement, that is, the 10 subsets would not share any common entry. When the size was larger than 1/10, the inter-subset replacement was inevitable such that some entries might be included in more than one subset. We had tested random sampling with or without inter-subset replacements, and the results showed no statistically significant difference. Therefore, the randomly-sampled target datasets used in this study were prepared by allowing inter-subset replacements.

Reduction of target dataset homology

The UniRef only provided three sequence identity non-redundant (NR) datasets, the UniRef100, UniRef90, and UniRef50. To finely test how the homology reduction would influence the performance of SSP algorithms, we had to make NR sets ranging from 90%, 80% to 25% identities. For homology levels ≥40% identity, the CD-HIT [43] was applied. Because CD-HIT did not support lower identities, for homology levels <40% identity, we changed to the USEARCH [44]. However, the 32-bit USEARCH had a limit of 4 GB memory usage, and the 40% identity NR set generated by CD-HIT was still too huge to apply the USEARCH directly. We hence designed the following procedure (illustrated in Fig 8) with USEARCH and BLAST as the homology reduction engines to generate low identity NR sets from a source dataset containing a vast number of sequences,
Fig 8

The procedure of sequence homology reduction of the target dataset.

The sequence identity non-redundant subsets of the UniRef90 was created using this procedure, which first divides the original dataset into subsets and performs a round of intra-subset homology reduction for each subset and iterative rounds of inter-subset homology reduction for every pair of subsets. The sequence homology reduction was made using the USEARCH and BLAST.

Sort the sequences of the input dataset by the length in descending order. Divide the sorted sequences into subsets, each containing m proteins. Let S1, S2 …, S denote the produced subsets, where N is the number of subsets. Intra-subset homology reduction. Let h denote the homology level of the final output dataset in percent. For each subset, apply USEARCH to generate its h % sequence identity NR set and replace the original subset by this NR set. After all the subsets are replaced by the NR sets, within every subset, any two sequences will share < h % sequence identity. Inter-subset homology reduction. Let x = 1 and take subset S to be the head set, or, the invariable set. From the subsets other than the head set, choose S, where y ≤ N, to be the body set, or, the variable set. Merge the head set and body set, run USEARCH to screen out homologs with identities ≥ h %. Reassemble the remaining sequences back into the head and body sets. Let S' and S' denote the reduced version of S and S, respectively. Let q denote a sequence removed from S by USEARCH, that is, q ∈ S−S'. Using BLAST, take q as the query sequence, find its homologous sequences with identities ≥ h % from S', and then eliminate those sequences from S'. Repeat this step until the homologs of all q are eliminated from S'. Let S be replaced by the last S'. Now the remaining sequences in S' all share < h % identities with the sequences in S. Repeat steps (2) to (5) using another subset to be the body set until no more subset can be applied. Then, push the sequences of S into the collection of NR sequences and delete S. Repeat steps (1) to (6) by setting x = x + 1 until no subset is available to be the body set (i.e., x = N). Push the final head set S into the collection of NR sequences. Save the collection of NR sequences as the final output dataset, in which any two proteins share < h % sequence identity. In this procedure, the head set S is also termed the invariable set because, during the inter-subset homology reduction step, it remains unchanged; homologous sequences are removed from the body sets, which are thus termed the variable sets. The size m of the divided subsets is subject to the limit of memory usage of the USEARCH; in this work, it was set to be 100,000 proteins.

The procedure of sequence homology reduction of the target dataset.

The sequence identity non-redundant subsets of the UniRef90 was created using this procedure, which first divides the original dataset into subsets and performs a round of intra-subset homology reduction for each subset and iterative rounds of inter-subset homology reduction for every pair of subsets. The sequence homology reduction was made using the USEARCH and BLAST.

Statistical analysis

The tests performed with random sampling in this study were all repeated ten times. For each test group, the time cost and accuracy measures were averaged, and the sample standard deviation of those measures was calculated. For checking whether the observed difference between two groups was statistically significant, several tests were made to compute the p-value. First, the Shapiro-Wilk test was applied to check the normality of the measure values of each group. Next, if the normal distribution of the values was verified, the F-test was performed to determine the equality of variances for the two groups. Finally, if the two groups were verified to come from populations with equal variance by the F-test, we used the Student’s t-test to compute the p-value; otherwise, the Welch’s t-test was applied.

Accuracy measures

The Q and SOV measures

In addition to the traditional Q3 and Q8 accuracy, we also evaluated SSP methods by the SOV (segment overlap) measure [48,49]. The definition of the Q accuracy was the number of correctly predicted residues divided by the number of predicted residues. The difference between Q3 and Q8 was the type of secondary structural element (SSE) codes applied in the prediction, i.e., three-state or eight-state codes. The SOV is a measure for evaluating the accuracy of SSP based on secondary structure segments instead of residues. It is generally regarded as a more critical way of assessment than the conventional Q, for its capability of capturing the overall quality of SSP for a protein and reducing noises from individual residues [48-50]. Previous works calculated the SOV based on the three-state SSE, or, the SOV3. In this study, the eight-state SOV (SOV8) was also calculated.

The macro-average, micro-average, and weighted average of accuracy measures

In general, when assessing an SSP algorithm, multiple query proteins are used. Therefore, the presented accuracy value is usually an average. In most previous SSP researches, an accuracy value was computed by averaging the accuracies of all individual query proteins using the classic arithmetic mean equation sum/n, where sum is the summation of accuracy from all proteins and n is the number of proteins. However, this classic average weighs every protein in the query dataset equally, which is questionable because the size of query proteins may be very different. For example, the size of the TS115 proteins ranges from 43 to 1,085 residues. In this study, the Q3 and Q8 were computed based on residues instead of proteins, i.e., the total number of correctly predicted residues from all query proteins divided by the total number of residues from all query proteins. The SOV3 and SOV8 measures could not be computed based on residues and were thus averaged over proteins with protein size as the weight, that is, where represents the weighted average of SOV, n denotes the number of query proteins, size and SOV stand for the size (number of residues) and SOV value of protein i, respectively. Computing the averaged accuracy based on proteins and residues is analogous to computing the macro-average and micro-average of the accuracy values, respectively. Both the micro-average Q3/Q8 and size-weighted average SOV3/SOV8 prevent underestimating the influence of large proteins or overestimating that of small ones, and hence they can precisely reflect the actual accuracy of SSP methods.

Computation of the information entropy of a PSSM

The information entropy proposed by Shannon measures the amount of information in a variable. It is also known as the disorder or uncertainty of a set of data [35]. The Shannon entropy S in the case of a multi-value or multi-state variable is given by the following formula, where c stands for a specific value or state of the variable, and p is the observed probability of c in the entire probability distribution of the variable. A traditional PSSM for a residue position contains 20 values, each assigned to an amino acid. The meaning of those values is, based on the multiple alignment between the query sequence and a set of identified homologous sequences, how possible in evolution the residue position of interest could be substituted with each of the 20 amino acids. For a highly conserved residue position, the probability distribution of the 20 amino acids is usually simple or concentrated; oppositely, for an evolutionarily diverse position, the distribution is complex and dispersed. Since the PSSM is by nature the probability distribution of a 20-state variable, we supposed that the Shannon entropy could be applied to quantify the complexity of a PSSM. Because for each residue, the PSSM output of PSI-BLAST readily provided the observed probabilities in percentage, the information entropy could be directly computed using Eq (10).

Example of PSSMs generated from homologs with high or low sequence homology.

(PDF) Click here for additional data file.

The time cost and Q3 accuracy of the assessed SSP methods as the length of target sequences increases.

(XLSX) Click here for additional data file.

Detailed results of the size reduction of the PSSM target dataset on three independent query sets.

(TIF) Click here for additional data file.

The time cost and accuracy of individual SSP methods assessed as the sequence homology of the target dataset reduces.

(TIF) Click here for additional data file.

The average time cost and accuracy of the assessed SSP methods on the CASP query sets as the sequence homology of the target dataset reduces.

(TIF) Click here for additional data file.

The average information entropy and Q3 accuracy of the assessed SSP methods on the CASP sets as the sequence homology of the target dataset reduces.

(TIF) Click here for additional data file.

The time cost and accuracy of individual SSP methods assessed as the homology of the target dataset reduces with a fixed dataset size.

(TIF) Click here for additional data file.

The average time cost and accuracy of the assessed SSP methods on the CASP query sets, as the sequence homology of the target dataset reduces with a fixed dataset size.

(TIF) Click here for additional data file.

The average information entropy and Q3 of the assessed SSP methods on the CASP query sets, as the homology of the target dataset reduces with a fixed dataset size.

(TIF) Click here for additional data file.

The average information entropy and Q3 of the assessed SSP methods on the CASP query sets, as the size of the target dataset reduces at a fixed homology level.

(TIF) Click here for additional data file.

PSI-BLAST settings of the assessed SSP methods.

(XLSX) Click here for additional data file.

The average and individual performance data of state-of-the-art SSP methods assessed with size-reduced PSSM target datasets.

(XLSX) Click here for additional data file.

The average and individual performance data of state-of-the-art SSP methods assessed with homology-reduced PSSM target datasets.

(XLSX) Click here for additional data file.

The average and individual performance data of state-of-the-art SSP methods assessed with size- and homology-reduced PSSM target datasets.

(XLSX) Click here for additional data file.

The average and individual performance data of state-of-the-art SSP methods assessed with the UniRef sequences of 2018.

(XLSX) Click here for additional data file.

Results of curve fitting for the average time cost and accuracy of the SSP methods assessed with target datasets of decreasing homology.

(XLSX) Click here for additional data file.

The time cost and accuracy of state-of-the-art SSP methods on the NrNCBI100-2020 target dataset.

(XLSX) Click here for additional data file.

Transfer Alert

This paper was transferred from another journal. As a result, its full editorial history (including decision letters, peer reviews and author responses) may not be present. 6 Mar 2020 PONE-D-19-35005 A simple strategy to enhance the speed of protein secondary structure prediction without sacrificing accuracy PLOS ONE Dear Prof. Lo, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. We would appreciate receiving your revised manuscript by Apr 20 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. We look forward to receiving your revised manuscript. Kind regards, M. Sohel Rahman, Ph.D. Academic Editor PLOS ONE Additional Editor Comments (if provided): Please attend to all the comments of the reviewers and carefully revise the manuscript. If you decide not to follow some comments, you MUST provide an appropriate rebuttal. Journal Requirements: When submitting your revision, we need you to address these additional requirements. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at http://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: N/A ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: No ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: In this paper, the authors have managed to prove their proposed hypothesis about the reduction of prediction time of secondary structure prediction. The reduction of target data set size as well as homology reduction of the target sequence have managed to cut the prediction tie by a significant margin. However, the following points should be addressed for further enhancement of the article: 1) The hypothesis was tested entirely on the Uniref90 data set, hence the equations (1), (2) and (3) of Tc, Q3 and Q8 respectively are based on this data set. But are these equations universal? Will they always follow a linear equation or other higher degree polynomials? Such questions might be answered by validating the hypothesis on another recent data set, for example CASP12 or CASP13. 2) The average information content or entropy of a PSSM for a particular target was measured by Shannon's equation which uses base 2 because of it's original implementation of measuring information in bit stream. Although we can convert between any base, the reasons for using base "e" was not cleared in manuscript. 3) Does the impact of reduction of target data size on accuracy depend on the length of the target sequence? An analysis based on different lengths of targets can be helpful here. Overall the paper is written in well structured English and organized in sequential step by step points proving the results of the hypothesis. Reviewer #2: This paper presents an interesting piece of work where authors described two simple strategies to reduce the time cost of a classical problem in the field of sequence-based prediction of structural properties of proteins, namely the secondary structure prediction (SSP). 1. The authors highlighted that computation of PSI-BLAST is the major time-consuming element of the SSP due to it’s prerequisite to search through a target database, such as Uniref90. As a potential strategy to reduce the time cost of SSP, the authors then proposed down sampling of the target dataset to search through for PSSM generations. Using seven SSP algorithms, authors showed that the reduction of target dataset has greater effect in decreasing the time-cost (3355 fold) with maximum of 14.1% compromise in accuracy. a. Do all the seven predictors recruited here to benchmark use UniRef90 for the blast search during PSSM computation? b. Can authors comment on the utility of NCBI Non-redundant database available at ftp://ftp.ncbi.nlm.nih.gov/blast/db/? c. There are SSP software that use NCBI Non-redundant database for PSSM computation. How do the time-costs required for PSSM computation using UniRef90 and NCBI NR databases compare? d. On which dataset the predictors were run to generate Fig. 1? 2. The authors discussed the effects of sequence redundancy of the reference dataset (here, UniRef90), which I appreciate. a. However, I would again like to see a comparison between using UniRef NR datasets and NCBI NR dataset. b. The results section of the paper has statements, like “use of 25% identity NR datasets ‘truly’ outperformed the 80% identify dataset”. I recommend stating the actual percentage increase in accuracy in these cases. c. As the improvement in accuracy by homology reduction did not reach a plateau, what was the rationale for stopping at 25%? Would be interesting to further reduce and check the improvement in accuracy. d. I recommend benchmarking the accuracies using an independent set of proteins and predicting the SS using UniRef NR90 to NR25. 3. I appreciate the authors’ effort to make the non-redundant UniRef datasets publicly accessible to foster downstream utility. 4. The paper needs a thorough proof-read for English, and especially, to eliminate redundancies. The current paper is unnecessarily long with many repetitive segments. 5. All figures are of very poor quality. Figures of appropriate resolutions are highly recommended. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step. 28 Apr 2020 ( A Word file of our responses to Reviewers' Comments has been uploaded. Although we make a plain text version here following the instruction of the submission system of PLOS ONE, we would like to suggest viewing the Word version. ) ========== Dear Professors and Editor, We would like to thank the anonymous reviewers for their careful reading and detailed comments, which have greatly enhanced this article. We are pleased that you find sufficient merit in the work as to ask for an appropriately revised manuscript. The new version of our manuscript has been modified according to referees' comments, which are also answered as follows. Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? Reviewer #1: Yes Reviewer #2: Yes Response: We would like to thank the reviewers for their positive comments. 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: N/A Response: Indeed, complicated statistical analyses are not required to conduct this research. Nevertheless, because we have done one thing very different from most previous SSP studies, i.e., multiple repeats of the experiments by random sampling, we were able to compute the standard deviation of the performance measures and the p-value between test groups. In the original draft, we had described how we compute the p-value in the subsection "Multiple repeats of experiments and statistical analysis." Considering that this title is out of the focus, we have changed it to "Statistical analysis" (Page 42) following the Submission Guidelines of PLOS ONE. We hope that this modification will make the readers more comfortable to find out how we made the significance tests. 3. Have the authors made all data underlying the findings in their manuscript fully available? Reviewer #1: Yes Reviewer #2: Yes Response: We thank the reviewers for their time and patience in reviewing our numerous supporting data files. 4. Is the manuscript presented in an intelligible fashion and written in standard English? Reviewer #1: Yes Reviewer #2: No Response: We thank the reviewers for the comments. Please see below for our response to the English writing of the revised manuscript. 5. Review Comments to the Author Reviewer #1: In this paper, the authors have managed to prove their proposed hypothesis about the reduction of prediction time of secondary structure prediction. The reduction of target data set size as well as homology reduction of the target sequence have managed to cut the prediction tie by a significant margin. However, the following points should be addressed for further enhancement of the article: 1) The hypothesis was tested entirely on the Uniref90 data set, hence the equations (1), (2) and (3) of Tc, Q3 and Q8 respectively are based on this data set. But are these equations universal? Will they always follow a linear equation or other higher degree polynomials? Such questions might be answered by validating the hypothesis on another recent data set, for example CASP12 or CASP13. Response: We thank the reviewer for the positive feedback and constructive comments, which greatly help us improve this work. The trends revealed by the equations of TC, Q3 and Q8 may be universal, but the fitted constants may not. Those constants are supposed to vary according to the datasets and hardware/software used to perform secondary structure prediction (SSP). First, we have added a note after those equations on Page 12 to make the reader aware of this. Second, to verify whether the trends of those equations are universal, we have utilized both the CASP12 and CASP13 to repeat all experiments. Comparing Eq (1) and the new Eqs (4) and (5) obtained with the CASPs, the linear relationship between the SSP time cost and the target dataset size can be confirmed. To help visualize the linear relationship, we have made a plot in the new S1 Fig showing the straight lines drawn based on the results of those independent query sets. As for the accuracy, the relationship between Q3/Q8 and the target dataset size is sigmoid, no matter being tested with the original query set TS115 or the CASPs (also shown in S1 Fig). We feel so grateful for having followed the suggestion from the reviewer and used these CASP sets because all the results obtained with them well support the conclusions of this study. In the revised manuscript, we have marked the new contents about the CASPs with an orange text background color. 2) The average information content or entropy of a PSSM for a particular target was measured by Shannon's equation which uses base 2 because of it's original implementation of measuring information in bit stream. Although we can convert between any base, the reasons for using base "e" was not cleared in manuscript. Response: Thank you very much for indicating this difference. Actually, at the last minute before our submission of the first draft, we had noticed it and thus put down that statement. The reason why we used base "e" was just because of the default of the log() function of the programming language. We have recalculated those values with base 2 and made relevant revisions, inclusive of Eq (10). Please see Figs 4, 6, and 7, as well as the new S4, S7, and S8 Figs for the updated data. By changing to base 2, the original conclusions were not affected. 3) Does the impact of reduction of target data size on accuracy depend on the length of the target sequence? An analysis based on different lengths of targets can be helpful here. Response: We appreciate this interesting comment and have accordingly done that. Before this experiment, it can be reasonably expected that the time cost will grow as the length of the target sequences increases. As for the accuracy, because we have a recent study on the limit of SSP accuracy showing that current SSP algorithms perform better for long query proteins than for short ones, by extending this concept it may be expected that the accuracy of SSP will rise as the length of target sequences increases. Finally, the experimental results shown in S2 File agree with the expected. The accuracy is improved as the length of target sequences increases, but the SSP time cost is also raised. Since this article is about the efficiency improvement of SSP, we find these results do not well fit the purpose of it. Therefore, we have not yet integrated this experiment into the revised manuscript. However, if the reviewer feels it necessary to add this part to enhance the completeness of the report, we will be happy to do that. Overall the paper is written in well structured English and organized in sequential step by step points proving the results of the hypothesis. Response: Thank you for your recognition of our writing quality. To improve the readability of this report, we have still made a lot of English writing optimization and deletions of redundant statements. Reviewer #2: This paper presents an interesting piece of work where authors described two simple strategies to reduce the time cost of a classical problem in the field of sequence-based prediction of structural properties of proteins, namely the secondary structure prediction (SSP). 1. The authors highlighted that computation of PSI-BLAST is the major time-consuming element of the SSP due to it's prerequisite to search through a target database, such as Uniref90. As a potential strategy to reduce the time cost of SSP, the authors then proposed down sampling of the target dataset to search through for PSSM generations. Using seven SSP algorithms, authors showed that the reduction of target dataset has greater effect in decreasing the time-cost (3355 fold) with maximum of 14.1% compromise in accuracy. a. Do all the seven predictors recruited here to benchmark use UniRef90 for the blast search during PSSM computation? b. Can authors comment on the utility of NCBI Non-redundant database available at ftp://ftp.ncbi.nlm.nih.gov/blast/db/? c. There are SSP software that use NCBI Non-redundant database for PSSM computation. How do the time-costs required for PSSM computation using UniRef90 and NCBI NR databases compare? d. On which dataset the predictors were run to generate Fig. 1? Response: We would like to thank the reviewer for the questions and valuable information, which provides more materials for us to test the feasibility of the proposed strategy. The questions are answered as follows, a: Not exactly. The UniRef90 is commonly used, but other non-redundant sequence datasets are also utilized. For instance, Spider2 uses the NR90 dataset established by HHsuite, and Psipred uses the NCBI NR dataset. As for DeepCNF, SpineX, and SSpro8, the UniRef was applied. By referring to the system Dr. Yaoqi Zhou benchmarked state-of-the-art SSP algorithms in the review paper [32], we decided to use the UniRef database. We have mentioned in the revised manuscript on Page 23, Lines 430–431 that the Psipred uses the NCBI NR set as the PSSM target dataset. b, c: We thank the reviewer for reminding us to use the NCBI NR for evaluating the proposed strategy. The core of our strategy is homology reduction combined with a size reduction. The sequence homology level of the NCBI NR dataset we obtained from the provided FTP site was 100% identity, higher than the ones we had tested (90% to 25%). Besides, the size of the NCBI NR dataset was much larger than the UniRef90-2015 and UniRef90-2018 we used to evaluate the strategy. This NCBI NR thus becomes an excellent example to test the feasibility of the proposed strategy in the opposite direction. It is also valuable for verifying the results we obtained with the UniRef datasets. We have used the results of UniRef homology reduction to establish a model (the new Eqs(6)–(8)) to predict the time cost and SSP accuracy if the NCBI NR were applied as the PSSM target dataset. According to the model, the time cost of using NCBI NR is expected to be 5,571 sec, and the accuracy will be lower than that obtained with the UniRef90. As expected, the actual time cost is 5,756 sec, and the accuracy was lower than that of the UniRef90-2015 and UniRef90-2018 (Pages 23–25). Besides, the difference in accuracy between NCBI NR and UniRef90s can be well explained by the information entropy of PSSM (Page 32). In the revised article, we have marked the new contents about the NCBI NR with a light green text background color. d: Thank you for pointing out this omission. In the previous manuscript, since the only query set was the TS115, we mentioned it just once in the Materials and Methods. Because three independent query sets (TS115, CASP12, and CASP13) and three sources of target sequences (UniRef90-2015, UniRef90-2018, and the NrNCBI100-2020) are used in the revised manuscript, we have stated explicitly for each experiment what query and target sets were applied. 2. The authors discussed the effects of sequence redundancy of the reference dataset (here, UniRef90), which I appreciate. a. However, I would again like to see a comparison between using UniRef NR datasets and NCBI NR dataset. b. The results section of the paper has statements, like "use of 25% identity NR datasets 'truly' outperformed the 80% identify dataset". I recommend stating the actual percentage increase in accuracy in these cases. c. As the improvement in accuracy by homology reduction did not reach a plateau, what was the rationale for stopping at 25%? Would be interesting to further reduce and check the improvement in accuracy. d. I recommend benchmarking the accuracies using an independent set of proteins and predicting the SS using UniRef NR90 to NR25. Response: We thank the reviewer for the suggestions. The questions are answered as follows, a: We appreciate the reviewer's suggestion about using the NCBI NR, which has helped us examine the proposed strategy more thoroughly than we did in the previous manuscript. Please see our response to your question 1b and 1c for the comparison between UniRef and NCBI NR datasets. b: Thank you for the constructive comment. In the previous manuscript, immediately after most summary statements we put supporting sentences. For instance, after the one you mentioned, we said: "Although the average differences in Q3 and Q8 between the 90% and 25% NR datasets were both only 0.70%, the average p-values were <1.25×10–6 and <9.99×10–8, respectively, …" In the revised manuscript, we have removed many unnecessary adverbs, adjectives, and summary statements (including the one we are talking about on Page 18). Hopefully, this "redundancy reduction" has improved the quality of our manuscript. c: In the experiment of Fig 2, we found that 5 million proteins would be the suitable target dataset size for accelerating SSP without sacrificing accuracy. In the experiments of homology reduction, the lowest sequence identity was 25% because when a lower identity was applied, the number of sequences remaining in the target dataset would be smaller than 5 million (see the new S6 Table) and cause unreasonable decreases in accuracy because of the small dataset size. We have added this explanation to the legend of Figs 2 and 3. In fact, if from the beginning a smaller size were applicable, the accuracy might keep increasing as the homology of the target dataset was reduced to lower than 25%. In one of our recent studies on SSP (submitted to PLOS ONE), tiny target sets of 10,000 proteins were prepared to dissect the influence of dataset sequence redundancy on the evaluation of SSP methods. In that study, we discovered that the accuracy of SSP would keep increasing down to a 10% identity homology reduction of the target set. d: We would like to thank the reviewer for this great quality-improving suggestion. We have applied the CASP12 and CASP13 independent sets to repeat all experiments, and the results agree well with our original conclusions. In the revised article, we have marked the new contents about the CASP tests with an orange text background color. 3. I appreciate the authors' effort to make the non-redundant UniRef datasets publicly accessible to foster downstream utility. Response: Thank you for the kind comment and your recognition of our effort. 4. The paper needs a thorough proof-read for English, and especially, to eliminate redundancies. The current paper is unnecessarily long with many repetitive segments. Response: We would like to thank the reviewer for pointing out these shortages. As non-native English speakers, we spent a lot of time writing. To improve the quality of this article, we have re-examined it and corrected several grammar or spelling errors. As for the redundancies, it was partly because we took PLOS ONE's suggestion of the figure legend too seriously: "allow readers to understand it (the figure) without referring to the text." We have removed many repetitive statements from the figure legends and simplified the main text, hoping to improve the reader's reading experience. The tracking of Microsoft Word was retained in the uploaded "Revised Manuscript with Track Changes" file. We hope these changes will meet your expectations. 5. All figures are of very poor quality. Figures of appropriate resolutions are highly recommended. Response: We thank the reviewer for reminding this. The figures embedded in the manuscript PDF file created by the submission system of PLOS ONE were all resampled. The original high-resolution figures can be obtained through the "Click here to access/download" hyperlink at the top of each figure page. The original figures were prepared according to the resolution required by the PLOS ONE, and the width was 2,250 pixels. If the reader uses a 100% scale to view a figure, all the points, curves, and text will be very sharp. However, we put eight panels in some figures. The points and curves on them are indeed too small for readers who do not use the viewing software that is convenient for zooming. We have modified those figures (inclusive of supporting figures) by magnifying the points and text. 6. Do you want your identity to be public for this peer review? Reviewer #1: No Reviewer #2: No Response: We would like to express again our sincere thanks to the anonymous reviewers for helping us improve this work. Submitted filename: Response to Reviewers.docx Click here for additional data file. 10 Jun 2020 A simple strategy to enhance the speed of protein secondary structure prediction without sacrificing accuracy PONE-D-19-35005R1 Dear Dr. Lo, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, M. Sohel Rahman, Ph.D. Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: (No Response) Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: (No Response) Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: (No Response) Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: (No Response) Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: (No Response) Reviewer #2: All comments have been addressed. Especially, the newly added results reporting on the evaluation of the proposed technique on the NR dataset in the revised paper is compelling. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No 19 Jun 2020 PONE-D-19-35005R1 A simple strategy to enhance the speed of protein secondary structure prediction without sacrificing accuracy Dear Dr. Lo: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. M. Sohel Rahman Academic Editor PLOS ONE
  49 in total

1.  Alignments grow, secondary structure prediction improves.

Authors:  Dariusz Przybylski; Burkhard Rost
Journal:  Proteins       Date:  2002-02-01

2.  ProbCons: Probabilistic consistency-based multiple sequence alignment.

Authors:  Chuong B Do; Mahathi S P Mahabhashyam; Michael Brudno; Serafim Batzoglou
Journal:  Genome Res       Date:  2005-02       Impact factor: 9.043

3.  Prediction of protein subcellular localization.

Authors:  Chin-Sheng Yu; Yu-Ching Chen; Chih-Hao Lu; Jenn-Kang Hwang
Journal:  Proteins       Date:  2006-08-15

4.  SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity.

Authors:  Christophe N Magnan; Pierre Baldi
Journal:  Bioinformatics       Date:  2014-05-24       Impact factor: 6.937

5.  PredAlgo: a new subcellular localization prediction tool dedicated to green algae.

Authors:  Marianne Tardif; Ariane Atteia; Michael Specht; Guillaume Cogne; Norbert Rolland; Sabine Brugière; Michael Hippler; Myriam Ferro; Christophe Bruley; Gilles Peltier; Olivier Vallon; Laurent Cournac
Journal:  Mol Biol Evol       Date:  2012-07-23       Impact factor: 16.240

6.  Protein 8-class secondary structure prediction using conditional neural fields.

Authors:  Zhiyong Wang; Feng Zhao; Jian Peng; Jinbo Xu
Journal:  Proteomics       Date:  2011-08-31       Impact factor: 3.984

7.  Automated inference of molecular mechanisms of disease from amino acid substitutions.

Authors:  Biao Li; Vidhya G Krishnan; Matthew E Mort; Fuxiao Xin; Kishore K Kamati; David N Cooper; Sean D Mooney; Predrag Radivojac
Journal:  Bioinformatics       Date:  2009-09-03       Impact factor: 6.937

8.  Detection and alignment of 3D domain swapping proteins using angle-distance image-based secondary structural matching techniques.

Authors:  Chia-Han Chu; Wei-Cheng Lo; Hsin-Wei Wang; Yen-Chu Hsu; Jenn-Kang Hwang; Ping-Chiang Lyu; Tun-Wen Pai; Chuan Yi Tang
Journal:  PLoS One       Date:  2010-10-14       Impact factor: 3.240

9.  CPred: a web server for predicting viable circular permutations in proteins.

Authors:  Wei-Cheng Lo; Li-Fen Wang; Yen-Yi Liu; Tian Dai; Jenn-Kang Hwang; Ping-Chiang Lyu
Journal:  Nucleic Acids Res       Date:  2012-06-11       Impact factor: 16.971

10.  iSARST: an integrated SARST web server for rapid protein structural similarity searches.

Authors:  Wei-Cheng Lo; Che-Yu Lee; Chi-Ching Lee; Ping-Chiang Lyu
Journal:  Nucleic Acids Res       Date:  2009-05-06       Impact factor: 16.971

View more
  1 in total

1.  A secondary structure-based position-specific scoring matrix applied to the improvement in protein secondary structure prediction.

Authors:  Teng-Ruei Chen; Sheng-Hung Juan; Yu-Wei Huang; Yen-Cheng Lin; Wei-Cheng Lo
Journal:  PLoS One       Date:  2021-07-28       Impact factor: 3.240

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.