Literature DB >> 33058261

ProteinUnet-An efficient alternative to SPIDER3-single for sequence-based prediction of protein secondary structures.

Krzysztof Kotowski¹, Tomasz Smolarczyk¹, Irena Roterman-Konieczna², Katarzyna Stapor¹.

Abstract

Predicting protein function and structure from sequence remains an unsolved problem in bioinformatics. The best performing methods rely heavily on evolutionary information from multiple sequence alignments, which means their accuracy deteriorates for sequences with a few homologs, and given the increasing sequence database sizes requires long computation times. Here, a single-sequence-based prediction method is presented, called ProteinUnet, leveraging an U-Net convolutional network architecture. It is compared to SPIDER3-Single model, based on long short-term memory-bidirectional recurrent neural networks architecture. Both methods achieve similar results for prediction of secondary structures (both three- and eight-state), half-sphere exposure, and contact number, but ProteinUnet has two times fewer parameters, 17 times shorter inference time, and can be trained 11 times faster. Moreover, ProteinUnet tends to be better for short sequences and residues with a low number of local contacts. Additionally, the method of loss weighting is presented as an effective way of increasing accuracy for rare secondary structures.

Entities: Chemical Disease Gene

Keywords: backbone angles estimation; deep learning; protein structure prediction; secondary structure prediction; solvent accessibility prediction

Year: 2020 PMID： 33058261 PMCID： PMC7756333 DOI： 10.1002/jcc.26432

Source DB: PubMed Journal: J Comput Chem ISSN： 0192-8651 Impact factor: 3.376

INTRODUCTION

A three‐dimensional protein structure is determined by the amino acid sequences[ , ] and is a key to their functional mechanisms. Experimental determination of the structure is costly and time‐consuming compared to sequence determination[ ] and the number of known sequences is even 1,000 times bigger than those of examined structures.[ ] This creates a need for techniques and models that will computationally predict a protein structure from its primary sequence. The challenge started in 1951 when Pauling and Corey predicted helical and sheet conformations for protein polypeptide backbone[ , , ] and has not been solved yet. Accurate protein structure and function prediction rely, in part, on the accuracy of secondary structure prediction that has been extensively studied and resulting in many computational methods (e.g., see an overview[ ]). A number of researchers also concentrate on predicting structural properties of proteins like backbone dihedral angles leveraging this information for secondary structure prediction or calculation ([ ] based on the early/late‐stage protein folding approach[ ]). Recently, developed state‐of‐the‐art methods of secondary structure prediction leverage deep neural network architectures and multiple sequence alignments (MSAs) of homologous sequences allowing them to achieve up to 88% Q3 accuracy,[ ] especially for proteins with a large number of known homologous sequences. However, the majority of proteins do not have any known homologous sequences or very few of them.[ ] For such cases, prediction accuracy can deteriorate because of the limited or nonexistent evolutionary information.[ ] Moreover, due to the increase in the number of known sequences, the computational time required for finding MSA profiles is also increasing, leading up to multiple hours for longer sequences. Heffernan et al.[ ] took advantage of the recent advancements in deep neural networks and proposed a single‐sequence‐based model using long short‐term memory (LSTM)‐bidirectional recurrent neural networks (BRNNs)—SPIDER3‐Single. The model can predict multiple one‐dimensional (1D) structural properties with relatively high accuracy, especially for nonhomologous sequences. In this article, we leverage alternative deep neural network architecture—U‐Net[ ]—for protein structure prediction and compare our results—ProteinUnet—to SPIDER3‐Single. The advantage of the U‐Net architecture allowed us to reduce the number of parameters in the network and significantly decrease the training and prediction time compared to SPIDER3‐Single while maintaining a similar performance of the model. The rest of the article is structured as follows: the second section describes the datasets used in the analysis with a brief description of inputs and outputs used by the models. The next section outlines both algorithms with the description of stratification, weights accounting for rare classes, and training procedures with ensembling. The following section describes the evaluation metrics for classification and regression. Finally, the last two sections present the results and conclusions.

METHODS

Datasets

In order to compare our implementation of the SPIDER3‐Single model to the original one, we have used the same datasets that were used by the authors of SPIDER3‐Single.[ ] The original dataset was downloaded from CullPDB[ , ] in February 2017 and split into several smaller datasets. Two of them: TR9993 and TS1199 are listed on the authors' website (https://sparks-lab.org/publication/). Train set TR9993 consists of 9,993 different chains from 9,622 proteins, and test set TS1199 consists of 1,199 chains from 1,187 different proteins. However, 16 of these proteins are no longer available in Protein Data Bank[ ] (checked on March 15, 2020). Additionally, 16 chains longer than 1,024 residues were removed from the training set since it is the maximum supported sequence length for ProteinUnet. Thus, we created subsets TR9961 (9,961 chains from 9,592 proteins) and TS1197 (1,197 chains from 1,186 proteins). Also, the performance was tested on 152 proteins from the CASP13 dataset.[ ]

Inputs

The input to the model for a given sequence is a one‐hot vector of size 20 × L, where L is the length of the protein chain, like in the original article. No other features, like physiochemical properties,[ ] BLOSUM matrix,[ ] PSSM,[ ] nor HHBlits[ ] were used. The idea behind the SPIDER3‐Single model was to let the neural network learn all the relationships directly from the sequence. The distribution of the amino acids is uneven and ranges from 9.6% for the most common leucine to 1.2% for the rarest cysteine (CYS).

Outputs

The model outputs could be divided into two main categories: classification and regression outputs. During the classification, the model predicts the secondary structure for eight and three states. The eight states are specified by the secondary structure assignment program Define Secondary Structure of Proteins[ ] as follows: there are three helix states: 310‐helix (G), alpha‐helix (H), and pi‐helix (I); three strand states: beta‐bridge (B) and beta‐strand (E); and three coil types: high curvature loop (S), beta‐turn (T), and coil (C). These eight classes are also converted into simpler, three‐class problem by grouping the states: G, H, and I into H; B and E into E; and S, T, and C into C. Each problem has separate output nodes in the neural network, resulting in 11 classification output nodes. The distribution of output classes is not even in our datasets. For the eight‐class problem, the share of classes ranges from 1% for the rarest I and B classes to 34% for the most common H class. The distributions are very similar between TR9961 and TS1197 datasets. The regression outputs were calculated using Biopython package[ ] and represent accessible surface area (ASA),[ ] angles ϕ, ψ, θ, τ (all angles have sine and cosine outputs to remove the effect of the angle's periodicity), half sphere exposure (HSE; there are separate outputs for HSE‐up and HSE‐down),[ ] and contact number (CN). For details, please refer to the original study.[ ] In overall, there are 12 regression outputs.

Models

All the methods were implemented in the environment containing Python 3.7, TensorFlow 2.2[ ] with Keras[ ] accelerated by CUDA 10.1, and cuDNN 7.6. (The prediction server based on our ProteinUnet is published on CodeOcean platform (https://codeocean.com/capsule/2521196/tree/v1).)

SPIDER3‐Single

SPIDER3‐Single[ ] is a network containing two BRNN layers of LSTM units with 256 nodes per direction, followed by the fully connected classifier with two hidden layers with 1,024 and 512 units. LSTM units[ ] are used to learn both short and distant dependencies within sequences, and the classifier is used to infer the output from these dependencies. The input of SPIDER3‐Single model is a one‐hot encoded single sequence of amino acids. This is the key difference from the original SPIDER3 model[ ] where additional evolutionary features are used like PSSM[ ] and HHBlits[ ] that are computationally expensive to obtain. Moreover, SPIDER3‐Single follows the famous postulate of Anfinsen[ ] that the secondary structure of a protein is completely determined by its amino acid sequence alone. The sizes and activations of the output layers differ between the tasks. For classification, there are two 1‐hot encoded output layers of size 3 × L (Q3 output) and 8 × L (Q8 output) followed by softmax activations. For regression, there are four output layers of size 2 × L (sine and cosine for each ϕ, ψ, θ, τ angle) and four output layers of size 1 × L (ASA, CN, HSE α‐up, and HSE α‐down features) followed by sigmoid activations. The values of the latter output features were normalized to the range <0, 1> by dividing them by their maximum values over the whole training dataset (ASA: 330, CN: 131, HSE α‐up: 76, HSE α‐down: 79). Additionally, the loss weights for these outputs were set to 2 in order to equalize the contributions of each feature in the loss. SPIDER3‐Single network has nearly 3.2 million trainable parameters from which two‐third belong to BRNN part and one‐third to the classifier part. This kind of network was proven to be very effective in secondary structure prediction,[ ] natural language processing,[ ] brain signals analysis,[ ] and series forecasting.[ ] In the original study of SPIDER3‐Single, the authors presented results of the model repeatedly stacked in the process called iterative learning. Iterative learning significantly increases the training time and complexity giving only small improvements to the accuracy. For purposes of our comparisons with ProteinUnet, we decided not to use iterative learning.

ProteinUnet

Our 1D fully convolutional ProteinUnet deep neural network consists of a series of blocks placed symmetrically as contractive and expanding paths (that can be broadly thought of as an encoder and decoder), yielding a U‐shape.[ ] It is a state‐of‐the‐art architecture in the domain of image segmentation.[ , ] The secondary structure prediction for 1D sequences is analogous to the multi‐label segmentation of 2D images, but to the best of our knowledge, U‐Net architecture has not been used previously for protein structure prediction.[ , , ] In our proposed architecture, each block in the contractive path contains three convolutional layers with zero padding and kernels of size 7 with stride 1, followed by a rectified linear unit (ReLU) activation. The first two blocks contain 64 filters per layer, and the second two contain 128 filters per layer. Each block ends with an average pooling layer with a kernel of size 2 to perform downsampling. In the expanding path, there are only two convolution layers per block. Each block is concatenated with the depth‐matched block from the contractive path, and then upsampled and passed to the next block. In this manner, high‐level features, extracted in the contractive path, propagate through higher‐resolution layers of the expanding path. It provides the local context to the global information while upsampling, increasing the precision of the output sequences. Finally, fully connected layers with 128 and 64 ReLU‐activated nodes are added as a classifier, followed by an output layer with softmax (for classification network) or sigmoid (for regression network) activations. The architecture diagram of the classification network is presented in Figure 1.

FIGURE 1

The architecture of ProteinUnet secondary structure classification network. The regression network differs only in the number and activations of output layers [Color figure can be viewed at wileyonlinelibrary.com] To decrease the number of the parameters and increase the correlation between Q8 and Q3 predictions, the output layer for states Q3 is calculated based on the output for Q8 (unlike in SPIDER3‐Single where the outputs for Q8 and Q3 are parallel). All the losses and metrics of the ProteinUnet are the same as in SPIDER3‐Single. The total number of trainable parameters of our ProteinUnet classification network is close to 1′597 k which is two times less than for SPIDER3‐Single. These two networks have very different hyperparameters (e.g., numbers of filters instead of hidden state dimensions), so they cannot be easily compared. Nevertheless, the training of ProteinUnet is more than 11 times faster, and the inference is over 17 times faster using Tesla K80 GPU, Intel Xeon 2.3 GHz, and 14 GB RAM, as presented in Table 1. Besides having two times fewer parameters, ProteinUnet, being a CNN, benefits more from cuDNN acceleration.[ ] Also, a constant size of inputs and outputs in ProteinUnet (in contrast to varying lengths in SPIDER3‐Single) makes it easier to implement and manage the memory on GPU.

TABLE 1

Comparison of mean training and prediction times for SPIDER3‐Single and ProteinUnet 10‐model ensembles

	Classification		Regression
	SPIDER3‐Single	ProteinUnet	SPIDER3‐Single	ProteinUnet
Mean training time per epoch (s)	524.9 ± 1.7	42.0 ± 0.1	527.8 ± 1.7	45.9 ± 0.3
Mean prediction time per chain in TS1197 (s)	1.12 ± 0.54	0.062 ± 0.0025	1.13 ± 0.54	0.066 ± 0.0031

Comparison of mean training and prediction times for SPIDER3‐Single and ProteinUnet 10‐model ensembles On the other hand, the constant input size is problematic in terms of variable‐length amino acid sequences. Thus, we decided to limit the length of supported sequences in our solution to 1,024 and fill shorter sequences with zeros, masking the loss and metrics accordingly (so the zeros do not affect the results of training or validation). ProteinUnet, like any other convolutional neural network, processes input sequences as separate patches using a window of the width of the convolutional kernel. Unlike BRNN, it is not sensitive to the order of the timesteps beyond a local scale. However, to recognize more distant patterns, many convolutional layers are stacked with pooling layers, extracting the information from long chunks of the sequence. The receptive field of our ProteinUnet was calculated (using ref. [36]) to be of 710 residues long, so any more distant contacts are impossible to be analyzed. However, such long‐range interactions are extremely rare and are present in less than 0.02% residues in our TS1197 dataset.

Handling imbalanced structure states

Some secondary structure states are relatively rare (like B, G, or I, each present for less than 5% of residues in the Q8 training set) what makes the dataset heavily imbalanced. Interestingly, this issue was not addressed in any previous work.[ , , , ] Our solution uses two methods to address this problem: stratification of folds, and adjusting Q8 loss weights to the frequency of the secondary structure states. There were nine factors of stratification of the training set: the sequence length—shorter/longer than mean sequence length, and one factor for each of eight states occurrence—fewer/more occurrences than a mean number of occurrences per chain (C—44.7, H—77.2, E—50.9, T—25.3, G—8.9, S—18.7, I—1.2, B—2.5). This technique ensures that in each of 10 folds there will be a similar ratio of each state. The same stratification was used for both ProteinUnet and SPIDER3‐Single. In a separate section of the article, the method of loss weighting was assessed for ProteinUnet Q8 classification. The weights for four least frequent structures (G, S, I, B) were adjusted to be inversely proportional to the percentage of their occurrence r in the TR9961 dataset using the formula log(0.25/. This should make ProteinUnet to pay more attention to the rare states.

Training procedures and ensembling

The training dataset was divided into 10 stratified folds for cross‐validation. For a fair comparison, both architectures were trained using the same division into folds. Each of 10 models was trained using Adam optimizer[ ] with batch size 8 and initial learning rate 0.001. Early stopping condition was used when the validation loss was not improving for 5 epochs. The training lasted from 12 to 16 epochs for classification (ProteinUnet—M = 13.4, SD = 0.9; SPIDER3‐Single—M = 13.9, SD = 1.1) and from 13 to 20 epochs for regression (ProteinUnet—M = 15.7, SD = 2.2; SPIDER3‐Single—M = 15.5, SD = 1.1). After the training, the ensemble was created from all the 10 models by taking the average of their outputs, forming the final prediction on the test set. There is no information about a batch size or a learning rate in articles about SPIDER3[ ] nor SPIDER3‐Single.[ ] Due to the variable length of the input of SPIDER3‐Single, the training with a batch size of 8 was implemented in a way where all the sequences in the batch are filled with zeros up to the length of the longest sequence in the batch. The loss and metrics are masked accordingly, so these additional zeros do not affect the results of training or validation. All the predictions on the test set were performed with a batch size 1 (one‐by‐one, without zero padding).

Evaluation metrics

Classification

The simplest and most popular measures of protein secondary structure prediction quality are average three‐state per‐residue accuracy Q3 and eight‐state per‐residue accuracy Q8. They give the percentage of residues for which the predicted secondary structures are correct[ , ] according to Equation (1)where m is the number of classes, N is the total number of residues, and M 𝑖𝑖 is the number of correctly predicted residues in state 𝑖. Q3 and Q8 accuracies are defined for m = 3 and m = 8, respectively.[ ] Since Q3 and Q8 are reported in almost every article, including the original SPIDER3‐Single study, we will use them in our comparisons as well.

Regression

The continuous variables are split into two groups, following the methodology described by SPIDER3‐Single authors,[ ] and each of them measures performance differently. ASA, CN, HSEα‐up, and HSEα‐down predicted values are compared to the true values using the Pearson correlation coefficient (CC), defined as Equation (2) where n is the sample size, and are the individual sample points indexed with 𝑖, x is the sample mean for the x variable, and y is the sample mean for the y variable. The performance of the ϕ, ψ, θ, and τ angles are calculated as the circular mean absolute error, which is the smaller of and (360 ° − to account for the periodicity of the angles, where , is the predicted angle value, and is the true angle value.

RESULTS

The comparison of overall results on the test sets between the original SPIDER3‐Single Iteration 2 (authors do not report results for Iteration 1), our reimplementation of SPIDER3‐Single, and the new proposed ProteinUnet is presented in Table 2. Because of all mentioned differences, it is impossible to directly compare the original and reimplemented SPIDER3‐Single. However, the results are on the similar level. In the direct comparison to the reimplemented SPIDER3‐Single, our ProteinUnet achieved better classification accuracies, but worse results for angles. However, all the differences are smaller than 2%.

TABLE 2

	(a)	(b)	(c)
	TS1199	TS1197	TS1197
Q3	72.56%	72.56%	72.66%
Q8	60.11%	59.88%	60.06%
ASA (CC)	0.671	0.669	0.667
HSEα‐up (CC)	0.612	0.608	0.602
HSEα‐down (CC)	0.568	0.566	0.567
CN (CC)	0.643	0.618	0.621
ϕ (MAE)	24.5	23.5	23.7
ψ (MAE)	43.5	41.8	42.3
θ (MAE)	11.3	10.1	10.2
τ (MAE)	45.8	43.2	43.8

Abbreviations: ASA, accessible surface area; CC, correlation coefficients; HSE, half sphere exposure; MAE, mean absolute error.

The comparison of performance for test sets between (a) original SPIDER3‐Single Iteration 2,[ ] (b) our reimplementation of SPIDER3‐Single, and (c) ProteinUnet according to fraction of residues in correctly predicted three and eight states (Q3 and Q8), Pearson CC, and MAE Abbreviations: ASA, accessible surface area; CC, correlation coefficients; HSE, half sphere exposure; MAE, mean absolute error. Table 3 shows the mean accuracies of Q3 and Q8 predictions at the sequence level in TS1197 and CASP13, along with SDs, and p‐values of the two‐sided Wilcoxon signed‐rank test between models. For TS1197, ProteinUnet gives better mean accuracies and lower SDs than SPIDER3‐Single. The difference for Q3 is significant at p < .05, and for Q8 at p < .0001. For CASP13 dataset, ProteinUnet gives worse results for Q3 (p < .05), and very similar results for Q8 (p = .90).

TABLE 3

Performance in secondary structure prediction by ProteinUnet and SPIDER3‐Single on TS1197 and CASP13[ ] according to mean accuracy and SD at the sequence level

		TS1197			CASP13
		Mean (%)	SD (%)	p‐Value	Mean (%)	SD (%)	p‐Value
Q3	ProteinUnet	73.53	8.70	.0152	74.39	8.13	.0128
Q3	SPIDER3‐Single	73.18	9.04	.0152	75.12	7.65	.0128
Q8	ProteinUnet	61.82	10.86	<.0001	60.81	12.17	.8961
Q8	SPIDER3‐Single	61.34	11.15	<.0001	60.81	12.79	.8961

Performance in secondary structure prediction by ProteinUnet and SPIDER3‐Single on TS1197 and CASP13[ ] according to mean accuracy and SD at the sequence level

Classification

Analysis per amino acid

The analysis of the classification accuracy per amino acid type is presented in Figure 2. The rare amino acids tend to have worse accuracy, like CYS, histidine (HIS), or tryptophan (TRP). From the rare amino acids, only methionine has accuracy above the average. The best Q3 accuracy for both models was achieved for proline (PRO): 76.26% for ProteinUnet and 76.69% for SPIDER3‐Single. The biggest difference for Q3 in favor of ProteinUnet is for TRP—0.48 pp. and in favor of SPIDER3‐Single for tyrosine—0.46 pp.

FIGURE 2

Accuracy of the secondary structure prediction (Q3 and Q8) for individual amino acids for SPIDER3‐Single (red triangles) and ProteinUnet (green circles) on TS1197 dataset. Three‐letter codes were used for amino acid residues. The size of the bubble represents the frequency of the amino acids. The gray horizontal line marks the fraction of residues in correctly predicted three and eight states (Q3 and Q8) [Color figure can be viewed at wileyonlinelibrary.com] Surprisingly, the Q8 accuracy for PRO is below average, and the best performing Q8 amino acid is isoleucine (ILE). Similarly, the worst Q8 accuracy was achieved for glycine (GLY) which shows above average results for Q3. The biggest difference for Q3 in favor of ProteinUnet is for CYS—0.90 pp. and in favor of SPIDER3‐Single for PRO—0.69 pp.

Analysis per sequence length

Figure 3 presents the Q3 accuracy as a function of sequence length. The linear regression models show that ProteinUnet has a higher accuracy for shorter chains but its accuracy decreases faster than for SPIDER3‐Single with increasing sequence length. The Q3 accuracy of the ProteinUnet was below 40% only for one chain, while for SPIDER3‐Single—six chains. Moreover, ProteinUnet achieved 100% Q3 accuracy for one protein sequence (2O6N Chain A) while SPIDER3‐Single was never 100% correct.

FIGURE 3

The accuracy of secondary structure prediction (Q3) for individual sequences against the sequence length for ProteinUnet (green circles) and SPIDER3‐Single (red triangles) on TS1197 dataset [Color figure can be viewed at wileyonlinelibrary.com] The biggest difference at the sequence level in favor of ProteinUnet was for protein 1T1V Chain A with 93 residues for which ProteinUnet achieved 79.57% while SPIDER3‐Single only 54.84%. SPIDER3‐Single had the biggest advantage over ProteinUnet for protein 1KAF Chain A with 108 residues for which ProteinUnet achieved 65.74% while SPIDER3‐Single 83.33%. In overall, ProteinUnet achieved better results for 578 sequences while SPIDER3‐Single for 520, respectively. For 99 sequences, both models achieved the same results, which do not necessarily mean they had the same predictions since the mistakes might have been on different positions.

Influence of Q8 loss weighting

The results for ProteinUnet with weighted Q8 loss are presented in Figure 4 in comparison to nonweighted versions of ProteinUnet and SPIDER3‐Single. As expected, weighting helped to achieve much better accuracies for all rare states (G, S, I, B). The highest increase (by 9 pp.) was noticed for structure G (310‐helix). For Structures B (beta‐bridge) and I (pi‐helix), weighting allowed to pull the accuracy out of 0% level. As a side effect of weighting, for states C (coil), H (alpha‐helix), and T (beta‐turn) accuracies decreased up to 2 pp. and were lower than for nonweighted ProteinUnet and SPIDER3‐Single. This caused the overall Q8 accuracy for weighted ProteinUnet to be slightly worse than before weighting, at both sequence (61.59%) and residues level (59.83%). Interestingly, after weighting, the accuracy for a frequent E state (beta‐strand) was better than for nonweighted ProteinUnet and SPIDER3‐Single. All the effects mentioned in this section were statistically significant at p < .005 according to the two‐sided Wilcoxon signed‐rank tests.

FIGURE 4

The comparison of mean accuracy at the sequence level for each Q8 state on TS1197 dataset between weighted and nonweighted ProteinUnet and nonweighted networks [Color figure can be viewed at wileyonlinelibrary.com]

Regression

Figure 5 presents the distribution of the regression outputs for TS1197 dataset. The majority of ϕ angles are close to the −63° and the predictions of both models are most common for −65°. However, both models rarely predict values below −125°, but more than 40,000 residues have true values below −125°. The ψ angles are grouped around two local maxima: −42 and 135°. Surprisingly, the predictions for ψ below −45° or above 150° are rare, while in true values, they account for more than 73,000 cases.

FIGURE 5

The distribution of regression outputs for TS1197 dataset. True values are presented with a solid gray line, prediction values for ProteinUnet with a solid green line and SPIDER3‐Single with a dashed red line [Color figure can be viewed at wileyonlinelibrary.com] The majority of θ and τ angles are close to the 91° maximum and around 117° local maximum and the values span from 64 to 177°. Both models' predictions fall between 84 and 145°, so the long tails are not predicted at all. Moreover, values between 95 and 120° were predicted much more often than they occurred. The τ angle predictions are grouped around two local maxima: 50 and −165°, but the true values are more distributed. Especially, the predictions around −165° are more than two times often than they actually occur. For angle τ prediction, SPIDER3‐Single tends to predict more often the values around maxima than ProteinUnet. Both models fail to predict the cases when ASA is equal to 0 with SPIDER3‐Single predictions slightly shifted to lower values. The sigmoid output function might be the reason for the poor performance of the predictions around 0 value. The ASA predictions for both models do not exceed 190, while the maximum true value was 297. The CN values span from 0 to 84, while the model predictions range between 10 and 65. SPIDER3‐Single prediction distribution is shifted to higher values. The distribution of HSE α‐up predictions does not resemble the true value distribution. The maximum predicted value was 31, while the maximum true value was 45. The true values of HSE α‐down range between 2 and 51, while the predictions fall between 6 and 35. For both HSE, predictions from SPIDER3‐Single are shifted more toward higher values compared to ProteinUnet.

Local contacts analysis

Figures 6 and 7 show the dependence of the accuracy of secondary structure prediction on the number of local and nonlocal contacts in a residue, respectively. Exactly like in ref. [10], nonlocal contacts are defined as contacts between two different residues that are more than or equal to 20 residues away in their sequence positions, but less than 8 A° away in terms of their atomic distances between Cα atoms. Each point presented on the plots has a representation of at least 1,000 residues. For both ProteinUnet and SPIDER3‐Single, accuracy for Q3 decreases sharply with the number of local and nonlocal contacts greater than 2. ProteinUnet shows noticeably better results for residues with a small number of local contacts (<3), but noticeably worse results for those with more than nine nonlocal contacts. It confirms that ProteinUnet is better at capturing close local dependencies (up to 12 pp. more for two local contacts), but worse at analyzing long‐range interactions (up to 2 pp. less for 11 nonlocal contacts). The number of nonlocal contacts is correlated with the length of the sequence, so it may partly explain the trend visible in Figure 3.

FIGURE 6

Accuracy of predicted Q3 as a function of the number of local contacts on TS1197 dataset [Color figure can be viewed at wileyonlinelibrary.com]

FIGURE 7

Accuracy of predicted Q3 as a function of the number of nonlocal contacts on TS1197 dataset [Color figure can be viewed at wileyonlinelibrary.com]

Accuracy of predicted Q3 as a function of the number of local contacts on TS1197 dataset [Color figure can be viewed at wileyonlinelibrary.com] Accuracy of predicted Q3 as a function of the number of nonlocal contacts on TS1197 dataset [Color figure can be viewed at wileyonlinelibrary.com]

CONCLUSIONS

ProteinUnet is the first model that successfully leverages U‐Net deep learning architecture for sequence‐based protein 1D structural properties prediction. The model does not use the evolutionary profiles generated from MSA like PSSM or HHblits, which are computationally expensive to calculate. It achieves comparable results to state‐of‐the‐art sequence‐based model—SPIDER3‐Single based on LSTM‐BRNN architecture while having two times fewer parameters and running several times faster (11 times faster training and 17 times faster inference). It makes it especially useful in large‐scale predictions and applications on low‐cost and embedded devices. Moreover, ProteinUnet shows better results for short sequences and residues with a low number of local contacts, so should be used preferably to SPIDER3‐Single when these factors matter. Additionally, our experiments showed that the proposed weighting procedure can be effectively used in ProteinUnet to substantially increase the accuracy on the rare states. The results on CASP13 dataset confirm that ProteinUnet performs as good as SPIDER3‐Single for completely untrained folds. The disadvantages of the proposed architecture are mainly connected with the limited receptive field of convolutional networks. They include decreased accuracy for long chains and residues with many nonlocal contacts. However, they may be addressed in the future by increasing the context or receptive field of U‐Net, or adding iterative training as described in ref. [10]. Moreover, the next future step is to improve the weighting procedure to avoid the decrease on the more frequent states.

28 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Getting to Know Your Neighbor: Protein Structure Prediction Comes of Age with Contextual Machine Learning.

Authors: Jack Hanson; Kuldip K Paliwal; Thomas Litfin; Yuedong Yang; Yaoqi Zhou
Journal: J Comput Biol Date: 2019-08-30 Impact factor: 1.479

Review 3. Protein secondary structure prediction using neural networks and deep learning: A review.

Authors: Wafaa Wardah; M G M Khan; Alok Sharma; Mahmood A Rashid
Journal: Comput Biol Chem Date: 2019-08-12 Impact factor: 2.877

4. Single-sequence-based prediction of protein secondary structures and solvent accessibility by deep whole-sequence learning.

Authors: Rhys Heffernan; Kuldip Paliwal; James Lyons; Jaswinder Singh; Yuedong Yang; Yaoqi Zhou
Journal: J Comput Chem Date: 2018-10-14 Impact factor: 3.376

5. Protein secondary structure prediction: A survey of the state of the art.

Authors: Qian Jiang; Xin Jin; Shin-Jye Lee; Shaowen Yao
Journal: J Mol Graph Model Date: 2017-07-19 Impact factor: 2.518

6. Fully-automated deep learning-powered system for DCE-MRI analysis of brain tumors.

Authors: Jakub Nalepa; Pablo Ribalta Lorenzo; Michal Marcinkiewicz; Barbara Bobek-Billewicz; Pawel Wawrzyniak; Maksym Walczak; Michal Kawulok; Wojciech Dudzik; Krzysztof Kotowski; Izabela Burda; Bartosz Machura; Grzegorz Mrukwa; Pawel Ulrych; Michael P Hayball
Journal: Artif Intell Med Date: 2019-11-27 Impact factor: 5.326

7. Biopython: freely available Python tools for computational molecular biology and bioinformatics.

Authors: Peter J A Cock; Tiago Antao; Jeffrey T Chang; Brad A Chapman; Cymon J Cox; Andrew Dalke; Iddo Friedberg; Thomas Hamelryck; Frank Kauff; Bartek Wilczynski; Michiel J L de Hoon
Journal: Bioinformatics Date: 2009-03-20 Impact factor: 6.937

8. Capturing non-local interactions by long short-term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility.

Authors: Rhys Heffernan; Yuedong Yang; Kuldip Paliwal; Yaoqi Zhou
Journal: Bioinformatics Date: 2017-09-15 Impact factor: 6.937

9. Protein structure determination using metagenome sequence data.

Authors: Sergey Ovchinnikov; Hahnbeom Park; Neha Varghese; Po-Ssu Huang; Georgios A Pavlopoulos; David E Kim; Hetunandan Kamisetty; Nikos C Kyrpides; David Baker
Journal: Science Date: 2017-01-20 Impact factor: 47.728

10. ProteinUnet-An efficient alternative to SPIDER3-single for sequence-based prediction of protein secondary structures.

Authors: Krzysztof Kotowski; Tomasz Smolarczyk; Irena Roterman-Konieczna; Katarzyna Stapor
Journal: J Comput Chem Date: 2020-10-15 Impact factor: 3.376

4 in total

1. Lightweight ProteinUnet2 network for protein secondary structure prediction: a step towards proper evaluation.

Authors: Katarzyna Stapor; Krzysztof Kotowski; Tomasz Smolarczyk; Irena Roterman
Journal: BMC Bioinformatics Date: 2022-03-22 Impact factor: 3.169

2. Reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment.

Authors: Jaspreet Singh; Kuldip Paliwal; Thomas Litfin; Jaswinder Singh; Yaoqi Zhou
Journal: Sci Rep Date: 2022-05-09 Impact factor: 4.996

Review 3. Recent Applications of Deep Learning Methods on Evolution- and Contact-Based Protein Structure Prediction.

Authors: Donghyuk Suh; Jai Woo Lee; Sun Choi; Yoonji Lee
Journal: Int J Mol Sci Date: 2021-06-02 Impact factor: 5.923

4. ProteinUnet-An efficient alternative to SPIDER3-single for sequence-based prediction of protein secondary structures.

Authors: Krzysztof Kotowski; Tomasz Smolarczyk; Irena Roterman-Konieczna; Katarzyna Stapor
Journal: J Comput Chem Date: 2020-10-15 Impact factor: 3.376

4 in total