Literature DB >> 35111575

Deep learning program to predict protein functions based on sequence information.

Chang Woo Ko^1,2, June Huh³, Jong-Wan Park^1,2.

Abstract

Deep learning technologies have been adopted to predict the functions of newly identified proteins in silico. However, most current models are not suitable for poorly characterized proteins because they require diverse information on target proteins. We designed a binary classification deep learning program requiring only sequence information. This program was named 'FUTUSA' (function teller using sequence alone). It applied sequence segmentation during the sequence feature extraction process, by a convolution neural network, to train the regional sequence patterns and their relationship. This segmentation process improved the predictive performance by 49% than the full-length process. Compared with a baseline method, our approach achieved higher performance in predicting oxidoreductase activity. In addition, FUTUSA also showed dramatic performance in predicting acetyltransferase and demethylase activities. Next, we tested the possibility that FUTUSA can predict the functional consequence of point mutation. After trained for monooxygenase activity, FUTUSA successfully predicted the impact of point mutations on phenylalanine hydroxylase, which is responsible for an inherited metabolic disease PKU. This deep-learning program can be used as the first-step tool for characterizing newly identified or poorly studied proteins.•We proposed new deep learning program to predict protein functions in silico that requires nothing more than the protein sequence information.•Due to application of sequence segmentation, the efficiency of prediction is improved.•This method makes prediction of the clinical impact of mutations or polymorphisms possible.

Entities: Chemical

Keywords: Deep learning; Point mutation; Protein functions; Sequence segmentation

Year: 2022 PMID： 35111575 PMCID： PMC8790617 DOI： 10.1016/j.mex.2022.101622

Source DB: PubMed Journal: MethodsX ISSN： 2215-0161

Specifications table

Overview

Advances in sequencing technologies help us discover a huge amount of protein variants that are generated from alternative mRNA splicing, incomplete translation, single amino acid polymorphism, or gene fusion. These mutations can lead to the expansion of protein functions, including protein-protein interactions, ligand binding, subcellular localization, or completely novel functions. Evolutionally, the sequence variations may be developed to overcome the limitation of gene numbers. We designed a deep learning program for protein function prediction, using only amino acids and named it ‘FUTUSA’ (function teller using sequence alone). Compared with other baseline method, FUTUSA achieved a better performance in protein function prediction. In predicting rare GO terms, particularly, FUTUSA improved the efficiency of classification performance. We also successfully predict the contribution of each amino acid to protein function. Therefore, FUTUSA could detect the functional motifs in the case of newly identified proteins and may also help us predict the critical impact of some clinically identified mutations in disease progression.

Methods

Data source and preprocessing

To minimize human cognitive biases in machine training steps, we only used GO annotation data and protein sequence data. We downloaded UniProtKB/Swiss-Prot database (version 2020_02) (https://www.uniprot.org/downloads) [1]. The datasets included 17,965 human proteins and 6,195 yeast proteins. The hierarchical information of GO terms and GO annotation data was obtained from Gene Ontology Consortium (http://geneontology.org) [[2], [3]]. We here used a GO OBO file (released on Sep 19th in 2018) containing 47,343 GO terms. The performance comparison was done with a previous GO file (released on Jan 1st 2016) containing 43,937 terms. The downloaded human and yeast GO annotation data files (released on Feb 2nd 2020) contained 495,361 annotations for human and 120,936 annotations for yeast. In this research, we used all evidence codes, including experimental, phylogenetically-inferred, and computational analyses. To compare ours with other prediction models, we used only EXP, IDA, IPI, IMP, IGI, IEP, TAS, and IC, which were considered as experimental evidence codes according to CAFA3 [4]. Using the datasets, we narrowed down the training target functions and constructed target specific dataset. For instance, if we picked one GO term such as “oxidoreductase activity”, we labeled all proteins that were annotated the GO term and its ‘is_a’ relation descendants such as “sulfur reductase activity”. We used protein segments composed of 16, 32, 64, or 128 amino acids for machine training. With this approach, we excluded zero padding step for fixing protein sequence information and reduced the file sizes of training data. We padded (size of segmentation-1) zeros at the N-terminus and (size of segmentation) zeros at the C-terminus of the sequence to give same weight to all amino acids except N-terminus methionine. Also, we didn't omit ‘ambiguous’ amino acids and considered all kinds of amino acids in sequence data. To reduce training time, however, we limited the maximum protein length catching up with each training purpose.

Deep learning architecture

We built the deep learning neural network models using Tensorflow2.0 framework and implemented them with the Keras deep learning library [5]. We used cloud-based environment, Google Colaboratory (https://colab.research.google.com) and trained the neural network models on GPUs offered by Google Cloud [6]. The models are organized with four subdivided sections, embedding layers, feature extraction, dense layers, and scoring steps. The first embedding layers convert preprocessed sequence data to numerical vector space that can be fed to the neural network. One hot encoding, one of the most common encoding methods, represent each amino acid with binary vector. This approach is straightforward and able to preserve original amino acid information, but it can cause the excessive complexity of model and cannot concern about physiochemical properties of amino acids. Recently, several research groups try to use machine-learning based encoding methods instead of manually define methods to solve the problems [7]. In this study, we used 1 × 1 CNN to generate amino acid embedding vector. And next, we extracted and learned the spatial features using CNN. The dense layers calculate the weight of previously generated feature map without concerning the topology, improving the recognition of distant features and concerns of their combinations for the classification. Lastly, the final output of binary classification from the dense layer are assumed as a predictive score of input segment and the total score for the input protein sequence is computed. We used the one-dimensional convolutional neural network (Conv1D) to extract the protein sequence-derived features. In Conv1D, 1D array-like kernel slides along one dimension and identify the patterns from sequence information. Conv1D has been widely used in sequence-based protein function prediction techniques [8], [9], [10], [11]. ‘FUTUSA’ was built to get more flexible at the feature extraction steps. We added 1 × 1 CNN after the one-hot encoding to make the model also learn how to understand each amino acid [12]. Using 1D convolution neural network with max pooling layer, it extracted the features without reducing its size. This model used variously sized convolution kernels, 2 and its geometric progressions with common ratio 2 smaller than its segmentation size and prime number just prior every geometric series. The sequence-derived feature vectors were batch-normalized and activated by Rectified-linear-unit (ReLU) function [13,14]. Through the concatenation layer and flatten layer, the feature map was fed to fully connected layers. All of the intermediate hidden layers were batch-normalized and performed dropout regularization (p = 0.2) [15]. We activated intermediate hidden layers with the ReLU function and classification layer with Sigmoid function. The characteristics of the models are shown in Table S1.

Datasets and model training

For the evaluation purpose, we randomly extracted 5% of the total protein data before the training. Next, we validated the model during the training by using five-fold cross validation. We randomly allocated 20% of the remaining 95% protein data as a validation dataset and the 80% as a training dataset. For a case study, we picked the specific target protein and set it as evaluation data. In our models, the deep learning system was trained with Adam optimizer at a learning rate of 0.001 [16]. We used weighted binary cross entropy as a loss function to address imbalanced dataset issue of rare protein function prediction problem [17]. The weighted binary cross entropy loss can be written asWhere is the ratio of number of labeled proteins to total trained proteins, is the real label and is the predicted value. This loss function is designed to give more weight to relatively rarely labeled functions. The mini batch size was set to 128. To avoid overfitting, we trained deep learning models for varied epochs from 10 to 100 but less than 12 h and adopted early stopping strategy. The characteristics of used datasets are shown in Table S2.

Comparison methods

To compare our models with others in predictive performance, we adopted CAFA baseline method, BLAST. BLAST model predicts protein function based on sequence identity. We made some minor changes in this model to predict a specific protein function because CAFA challenge originally aims to explore general functions of proteins. We made database with training dataset and found similar sequences with queried target proteins using BLASTP software [18]. The BLAST results contained the target training proteins after they were filtered by E-value of E-5.

Performance evaluation

We evaluated the predictive performance through average precision (AP), F1-score (F1), area under receiver operating characteristic curve (AUROC), area under precision-recall curve (AUPR) and Matthews correlation coefficient (MCC) [[19], [20], [21]]. The MCC, widely used confusion matrix describing method, at threshold can be computed directly from the confusion matrix using the following formula:Where is the number of true positives; is the number of true negatives; is the number of false positives; is the number of false negatives at threshold . F1-score is harmonic average of the precision and the recall. The precision and the recall at threshold can be written asHence, F1-score can be calculated by the following formulas: In this study, we presented the maximum evaluation value for all thresholds computed with a step size 0.01 unless otherwise noted. The F1-score and the MCC score are visualized with the MCC-F1 curve to address the full range of all possible thresholds [22]. AP, AUROC, and AUPR are independent of the threshold. The receiver operating characteristic (ROC) curve is a plot of true positive rate (recall) versus false positive rate and precision-recall curve is a plot of precision versus recall. The AP is also the method to summarize the precision-recall curve. The true positive rate and false positive rate at threshold can be written asHence, AUROC, AUPR, AP can be calculated by the following formulas:Where and are the precision and recall at the nth threshold.

Validation and discussion

Sequence segments-based training

First, we tested whether the sequence segmentation method improves the predictive performance of protein functions. We compared the segmentation model with the full-length model that uses a zero-padding layer to fit the length of the protein sequences. The fully connected layers were unaltered in this model. We trained FUTUSA to predict oxidoreductase activity (GO:0016491) and its descendants, which were 2371 terms. We used two different scoring methods to determine the score of each protein. One used the average for the predictive scores of all segments from the protein, and the other did the average for only the highest-scored segments in the top 10% to reduce the error from the protein lengths and non-functional residues. Because the performances of the methods depend on the context, we selected the better scoring method whenever the model training was completed. In all five metrics (AP, F1, AUROC, AUPR, and MCC), the sequence segmentation improved the classification performance regardless of the size of segmentation and CNN architecture. Fig. 1. displays the MCC-F1 curve, ROC curve, and P-R curve for various segmentation sizes. All the sequence segmentations were found to be better than the full-length process. Of these segmentations, the 64 amino acid segmentation was shown as the best setting. The overall results are summarized in Table 1. The results revealed that the machine learning with sequence segments significantly improves the ability of protein feature recognition.

Fig. 1

The MCC-F1, ROC, P-R curves of all tested sequence segmentation sizes. The MCC-F1 (a), ROC (b), P-R (c) curves (FL: full-length model (blue); 16: segmentation size 16 model (green); 32: segmentation size 32 model (orange); 64: segmentation size 64 model (red); 128: segmentation size 128 model (black).

Table 1

The overall evaluation results for all tested sequence segmentation sizes.

Model	AP	MCC	F1	AUPR	AUROC
FUTUSA_FL	0.3089	0.3863	0.3421	0.3058	0.7525
FUTUSA_16	0.4604	0.4764	0.4533	0.4576	0.8754
FUTUSA_32	0.4661	0.4413	.0.4494	0.4631	0.8872
FUTUSA_64	0.5158	0.5186	0.5319	0.5129	0.8835
FUTUSA_128	0.4612	0.5406	0.5135	0.4592	0.7671

Predictive performance comparison

To verify our approach, we compared our model with the baseline method BLAST. We selected oxidoreductase activity (GO:0016491, 721 annotated proteins) as a representative case for the big-sized dataset (>100 proteins), and the acetyltransferase activity (GO:0016407, 81 annotated proteins) and demethylase activity (GO:0032451, 28 annotated proteins) as the cases for the small-sized datasets. We trained FUTUSA with 64 amino acid segmentation. Table 2 shows the predictive performances of various models on the three different datasets. In case of oxidoreductase, FUTUSA obtained AP = 0.4319, MCC = 0.4508, and F1 = 0.4528, which are higher than those in BLAST. We here adopted acetyltransferase and demethylase as the cases for small dataset size. FUTUSA showed higher values of AUPR (0.3166 for acetyltransferase; 0.3297 for demethylase) than BLAST model. Consequently, we propose FUTUSA as a powerful protein function-predicting program applicable to either big-sized or small-sized dataset.

Table 2

The performance comparison of the competing models on the oxidoreductase activity (GO:0016491), acetyltransferase activity (GO:0016407) and demethylase activity (GO:0032451) datasets.

	Programs	AP	MCC	F1	AUPR	AUROC
Oxidoreductase	BLAST	0.1509	0.3386	0.3014	-	-
Oxidoreductase	FUTUSA	0.4319	0.4508	0.4528	0.4272	0.8136
Acetyltransferase	BLAST	0.0649	0.1818	0.2374	-	-
Acetyltransferase	FUTUSA	0.3212	0.4444	0.5331	0.3166	0.7587
Demethylase	BLAST	0.1521	0.3529	0.3826	-	-
Demethylase	FUTUSA	0.3486	0.5000	0.5145	0.3297	0.6906

Calculation of area under ROC curve and PR curve does not assess the performance of binary predictor, BLAST model.

The performance comparison of the competing models on the oxidoreductase activity (GO:0016491), acetyltransferase activity (GO:0016407) and demethylase activity (GO:0032451) datasets. Calculation of area under ROC curve and PR curve does not assess the performance of binary predictor, BLAST model.

Prediction of the functional impacts of mutations in phenylalanine hydroxylase

To explore a new application of FUTUSA, we investigated whether it could predict the functional consequence of single amino acid variation. To evaluate the contribution of each amino acid to functional property, we computed the average values of all segments in phenylalanine hydroxylase (PAH), which is responsible for an inherited metabolic disease PKU. PAH, which belongs to the monooxygenase family, is a homo-tetrameric enzyme composed of an N-terminal regulatory domain, a central catalytic domain, and a C-terminal tetramerization domain [23]. Clinically, it is very important to predict the functional consequences of the PAH mutations in patients’ specimens. Such an information may be one of guiding factors to decide how to care PKU patients. To predict PAH activity, we trained FUTUSA with new dataset comprising 114 monooxygenases. The trained model successfully predicted the catalytic domain as a crucial region and assigned low scores to regulatory and tetramerization domains (Fig. 2a). The N-terminus is classified as the ACT domain, which is widely expressed in enzymes of amino acid metabolism. Notably, this region obtained a low score even though there were several identical residues with other aromatic amino acid hydroxylases. It may be interpreted that this domain is not essential for monooxygenase activity. To visualize the prediction results, we highlighted the high score region in the 3D structure of human PAH protein using CAVER Analyst 2.0 [[24], [25], [26]]. Previous studies reported that phenylalanine forms hydrophobic interactions with R270, T278, P279, F331, G346, G349, and S350. The thiophene ring of the substrate was stacked against the imidazole ring of H285 [25,27]. The high score region of the model was close to the active site of human PAH. Fig. 2b shows that they covered most of the substrate-binding residues and iron-binding residues. The low score regions do not contact the substrate and are located far from the active site. In particular, the prediction result of FUTUSA covered the active site lid, Y138 [28].

Fig. 2

The heatmap visualization of predicted functional contribution score of individual amino acids. The predicted scores of FUTUSA are also overlaid onto crystal structure of the full-length human PAH (residues 21–446; PDB:6N1K) and catalytic domain of human PAH (residues 117-428; PDB:1MMK). (a) The heatmap is mapping with green as low predictive score and red as high predictive score. (b) The iron ion (cyan) is highlighted in balls. The substrate analogue, beta-2-Thienylalanine (THA; yellow) and amino acid residues of binding pockets (gray and brown) are presented as sticks. The prediction was performed with Phenylalanine-4-hydroxylase for monooxygenase activity. Next, we computed the change in the predictive score after single amino acid deletion and substitution. There are 100 or more PAH mutations (missense and frameshift) that can lead to phenylketonuria (PKU), as reported in the ClinVar database [29]. These mutations present a broad range of phenotypic variations depending on the residual enzymatic activities. Hence, we verified the predictive performance of the full-length models and segmentation models with the mutated sequences. As a result, FUTUSA_FL (full-length) failed to predict the outcomes of single amino acid changes (Fig. 3a). However, the segmentation models composed of 16 (Fig. 3b) and 64 (Fig. 3c) amino acids were able to compute remarkable score changes after mutation occurs. FUTUSA_16 showed score drops only in several crucial regions, but FUTUSA_64 marked crucial regions more widely. The regulatory and oligomerization domains showed trivial score changes in both segmentation models. FUTUSA_64 showed that many substitutions of amino acids in the catalytic domain decreased the predictive score, but those in exons 4 and 11 did not substantially (Fig. 3c). Numbers of missense variants and exon deletions, related to loss of function and phenylketonuria, were detected in the catalytic domain [[30], [31]]. These results reveal that the functional sites predicted by FUTUSA are well matched with the known functional domains.

Fig. 3

The heatmap of the score changes by single amino acid mutations. Protein function changes by point mutations were predicted using FUTUSA_FL (a), FUTUSA_16 (b), FUTUSA_64 (c) The color indicates the score changes after mutation, blue as decreased score and red ad increased score. Each column represents the position of the amino acid and each row represents the changed amino acid after mutation. The first row, del, indicates the deletion of the amino acid. The prediction was performed with Phenylalanine-4-hydroxylase for monooxygenase activity.

Discussion

We here propose a deep learning-based protein function predictor, which is named FUTUSA. Since it requires only the sequence information, FUTUSA is suitable for preliminary characterization of newly found proteins or uncharacterized variants. We also established a preprocessing method for sequence segmentation, which effectively extracts functional features from protein sequence data and augments the performance of function prediction. FUTUSA showed a better performance than the baseline method, BLAST. Therefore, we propose FUTUSA as a new deep learning program predicting the functions of uncharacterized proteins. Additionally, FUTUSA distinguished functionally essential amino acids with nonessential ones, suggesting that FUTUSA could be used for predicting the clinical impact of point mutations or single amino acid polymorphisms. Generally, protein function prediction has a fundamental problem resulting from training with imbalanced dataset. For instance, a number of oxidoreductase proteins is 719, which is only 3.4% of total proteins trained in FUTUSA. It means that the machine was trained more with irrelevant proteins (as negative controls) than with target proteins. Especially in the small-sized dataset such as acetyltransferase and demethylase, this class imbalance in training raises the false-positive rate to a greater extent. Reviewing the equations for AUPR and AUROC, AUROC is substantially affected by the false-positivity, whereas AUPR does not [32]. In case of the small-sized dataset, therefore, AUPR may evaluate the efficiency of function prediction more accurately than AUROC. For this reason, we emphasize the AUPR values (Table 2). It should be noted that FUTUSA has some limitations in the data process. The sequence segmentation process increases the input data size, thereby increasing the training time. Thus, the segmentation should be optimized to achieve a good balance between the predictive performance and training time. Also, since FUTUSA is not a ready-to-use predictor, users should modify training dataset, optimize preprocessing parameters, and train it according to their purposes. One of the reasons for the delay in the development of deep learning-based protein function predictors is that the performance of deep learning models depends on the quality and quantity of training data. Therefore, some researchers used diverse protein features, such as protein-protein interactions or protein motifs [8,9,33]. However, the features are available for extensively studied proteins, but not for uncharacterized proteins. To solve such a problem, many researchers have tried to develop new protein function predictors requiring only amino acid sequences [10,34,35]. Despite many efforts, the sequence-alone approaches have not been satisfactory to users. In this context, FUTUSA is proposed as a new method to meet users’ needs. In a future work, we plan to modify the preprocessing process. In the present study, we assigned the same weight to all segmented sequences, even if the sequences insignificantly contributed to protein function. It is expected that assigning different weights to each segment can minimize artificial biases. For that, we should add a step to re-evaluate how much each segment contributes to the whole function. In addition, the model should be changed to a multi-class classification model. This version of the model was built as a single-class classification model to focus its predictive ability in classifying the protein function. However, it is evident that the hierarchical structure of GO terms and the corresponding annotated patterns also contain important information. Therefore, we will intend to find the GO terms grouping method to optimally predict a single GO term. All the source code and datasets are available at https://github.com/snuhl-crain/FUTUSA.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Subject Area:	Biochemistry, Genetics and Molecular Biology
More specific subject area:	Protein Function Prediction
Method name:	FUTUSA (function teller using sequence alone)
Name and reference of original method:	There is no original method
Resource availability:	https://github.com/snuhl-crain/FUTUSA

24 in total

1. [Characteristics of PAH gene variants among 113 phenylketonuria patients from Henan Province].

Authors: Chen Chen; Zhenhua Zhao; Yilin Ren; Xiangdong Kong
Journal: Zhonghua Yi Xue Yi Chuan Xue Za Zhi Date: 2018-12-10

2. Structure of full-length human phenylalanine hydroxylase in complex with tetrahydrobiopterin.

Authors: Marte Innselset Flydal; Martín Alcorlo-Pagés; Fredrik Gullaksen Johannessen; Siseth Martínez-Caballero; Lars Skjærven; Rafael Fernandez-Leiro; Aurora Martinez; Juan A Hermoso
Journal: Proc Natl Acad Sci U S A Date: 2019-05-22 Impact factor: 11.205

3. The meaning and use of the area under a receiver operating characteristic (ROC) curve.

Authors: J A Hanley; B J McNeil
Journal: Radiology Date: 1982-04 Impact factor: 11.105

4. Biophysical characterization of full-length human phenylalanine hydroxylase provides a deeper understanding of its quaternary structure equilibrium.

Authors: Emilia C Arturo; Kushol Gupta; Michael R Hansen; Elias Borne; Eileen K Jaffe
Journal: J Biol Chem Date: 2019-05-10 Impact factor: 5.157

Review 5. Phenylalanine hydroxylase: function, structure, and regulation.

Authors: Marte I Flydal; Aurora Martinez
Journal: IUBMB Life Date: 2013-03-04 Impact factor: 3.885

6. Mutation analysis of PAH gene and characterization of a recurrent deletion mutation in Korean patients with phenylketonuria.

Authors: Yong Wha Lee; Dong Hwan Lee; Nam Doo Kim; Seung Tae Lee; Jee Young Ahn; Tae Youn Choi; You Kyoung Lee; Sun Hee Kim; Jong Won Kim; Chang Seok Ki
Journal: Exp Mol Med Date: 2008-10-31 Impact factor: 8.718

7. Deep Robust Framework for Protein Function Prediction Using Variable-Length Protein Sequences.

Authors: Ashish Ranjan; Md Shah Fahad; David Fernandez-Baca; Akshay Deepak; Sudhakar Tripathi
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2019-04-16 Impact factor: 3.710

8. UDSMProt: universal deep sequence models for protein classification.

Authors: Nils Strodthoff; Patrick Wagner; Markus Wenzel; Wojciech Samek
Journal: Bioinformatics Date: 2020-04-15 Impact factor: 6.937

9. CAVER Analyst 2.0: analysis and visualization of channels and tunnels in protein structures and molecular dynamics trajectories.

Authors: Adam Jurcik; David Bednar; Jan Byska; Sergio M Marques; Katarina Furmanova; Lukas Daniel; Piia Kokkonen; Jan Brezovsky; Ondrej Strnad; Jan Stourac; Antonin Pavelka; Martin Manak; Jiri Damborsky; Barbora Kozlikova
Journal: Bioinformatics Date: 2018-10-15 Impact factor: 6.937

10. SDN2GO: An Integrated Deep Learning Model for Protein Function Prediction.

Authors: Yideng Cai; Jiacheng Wang; Lei Deng
Journal: Front Bioeng Biotechnol Date: 2020-04-29