Literature DB >> 33367627

Identification of Sub-Golgi protein localization by use of deep representation learning features.

Zhibin Lv¹, Pingping Wang², Quan Zou^1,3,4, Qinghua Jiang².

Abstract

MOTIVATION: The Golgi apparatus has a key functional role in protein biosynthesis within the eukaryotic cell with malfunction resulting in various neurodegenerative diseases. For a better understanding of the Golgi apparatus, it is essential to identification of sub-Golgi protein localization. Although some machine learning methods have been used to identify sub-Golgi localization proteins by sequence representation fusion, more accurate sub-Golgi protein identification is still challenging by existing methodology.
RESULTS: we developed a protein sub-Golgi localization identification protocol using deep representation learning features with 107 dimensions. By this protocol, we demonstrated that instead of multi-type protein sequence feature representation fusion as in previous state-of-the-art sub-Golgi-protein localization classifiers, it is sufficient to exploit only one type of feature representation for more accurately identification of sub-Golgi proteins. Compared with independent testing results for benchmark datasets, our protocol is able to perform generally, reliably, and robustly for sub-Golgi protein localization prediction. AVAILABILITY: A use-friendly webserver is freely accessible at http://isGP-DRLF.aibiochem.net and the prediction code is accessible at https://github.com/zhibinlv/isGP-DRLF. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Year: 2020 PMID： 33367627 PMCID： PMC8023683 DOI： 10.1093/bioinformatics/btaa1074

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

As an important organelle in eukaryotic cells, the Golgi apparatus (GA) acts to process, sort and transport proteins synthesized by the endoplasmic reticulum (Tao et al., 2020), which are then delivered to specific compartments of the cell or secreted from the cell (Holthuis ). GA malfunction can result in Parkinson’s disease (Fujita et al., 2006), Alzheimer’s disease (Gonatas et al., 1998) and other neurodegenerative disorders (Ligon et al., 2020). Thus it is essential to understand the functional details of the GA (Ravichandran et al., 2020) such as protein localization in the cis-Golgi (cis-Golgi protein) or in the trans-Golgi (trans-Golgi protein) (De Tito ). Such an understanding would clarify GA function (Berry ) and would provide clues to aid drug discovery and development (Stoeber et al., 2018). Over the past decade, several machine learning-based methods for identification of sub-Golgi proteins localization (Yang et al., 2019b), including proteins cis- and trans-Golgi localization, have been developed using a few benchmark datasets (Ding et al., 2011, 2013; Yang et al., 2016b; Zhao et al., 2019). The previously reported protein sub-Golgi localization classifiers shared some common aspects to achieve high identification accuracy. The first was to use sequence feature fusion for better protein sequence representation (Ahmad ; Rahman et al., 2018; Wang et al., 2020a). To the best of our knowledge, the most widely used features for high performance sub-Golgi protein classifiers are the fusion of a position specific scoring matrix (PSSM) (Jiao et al., 2016b; Shen et al., 2019a,b; Yang ), dipeptide composition frequency (Ding , 2013; Lv ; Rahman ), pseudo amino acid physical and chemical properties (Jiao ,c; Zhao ; Zhou et al., 2019) and their derivative features. The second was to use an over-sampling method to overcome the lack of sub-Golgi protein localization balance in training benchmark datasets (Ahmad , 2019; Lv ; Rahman ; Yang ). In addition, to feature fusion and data over-sampling, the employed selection technologies included analysis of variance (ANOVA) (Ding ; Tang et al., 2018), Fisher method, minimum redundancy maximum relevance (mRMR), random forest recursive feature elimination (RF-RFE) and other methods that selected the best features in the vector space(Ahmad , 2019; Ding ; Jiao ; Rahman ; Wang et al., 2020b; Yang ). By combining the above techniques, the reported protein sub-Golgi localization classifier (isGPT) (Rahman ) was with the best independent testing scores (ACC = 95.3%, MCC = 0.85, Sn = 84.6% and Sp = 98.0%). The good performance of isGPT was attained by carefully selected 2800 fusion features from the original feature space with 18 840 dimensions. The fusion features of isGPT were derived by combination of six types of protein sequence representation methods that included amino acid composition, dipeptides composition, tripeptide composition, n-gapped-dipeptides composition, position specific features and pseudo amino acid composition (Rahman ). Feature extraction plays an important role in protein sequence analysis and an appropriate and selected feature representation such as that used by isGPT (Rahman ) greatly improves the accuracy of protein sequence analysis (Lv ). Given its automatic feature extraction and powerful feature representation capabilities, deep learning has been widely used in sequence analysis of proteins, DNA and RNA (Min ; Wang et al., 2017; Xu, 2019; Xu et al., 2017, 2019). Deep learning is a form of machine learning that automatically learns feature representation by capturing parameters in a neural network (Eraslan ; Jiang ). Based on this principle, pre-trained deep learning networks can be used for feature extraction from new data or migrated for application to other similar tasks such as image recognition and natural language processing, which are known as transfer learning (Zhang et al., 2016; Zhou et al., 2020). In 2019, Alley proposed a self-supervised and universal protein sequence deep representation learning tool, UniRep, which was trained using UniRef50 (a dataset with tens of millions of protein sequences) to better represent natural and de-novo designed proteins. Also, some other preprint papers such as TAPE (Rao ), BiLSTM embedding model (Bepler ), PRoBERTa (Nambiar ) and MULocDeep (Jiang ) have used similar ideas to encode protein sequences in a deep representations learning way and have obtained good results in many protein-sequence analysis applications. In this work, we utilized UniRep to extract deep representation learning features for sub-Golgi protein sequences. Then, by using synthetic minority over-sampling (SMOTE) and light gradient boosting machine (LGBM) feature selection methods, we developed a high performance support vector machine based sub-Golgi protein localization classifier named as isGP-DRLF. The leave-one-out cross-validation scores of isGP-DRLF based on D5 dataset with only one type of feature representation vectors was ACC = 99.2%, MCC = 0.98, Sn = 100% and Sp = 98.4%, while the independent testing score metrics for isGP-DRLF was ACC = 96.4%, MCC = 0.90, Sp = 84.6% and Sn = 100%. The isGP-DRLF based on D3 dataset was with independent testing scores of ACC = 98.4%, MCC = 0.95, Sn = 100% and Sp = 98.0%, which improved by the relative value of 3.25%, 11.7%, 18.2% and 0.0% compared to the previously reported best independent-testing sub-Golgi protein classifier (isGPT) with six types of feature representation vectors. While isGPT used 2800-dimension fusion features for prediction, our isGP-DRLF used only 107-dimension features without any feature fusion. Herein, for sub-Golgi protein localization prediction, the effect of LGBM feature selection was better than that of ANOVA and MRMD feature selection technologies. The support vector machine algorithm is the best sub-Golgi protein identification algorithm in this study. A user-friendly isGP-DRLF webserver is available at http://isGP-DRLF.aibiochem.net for small sequence dataset computing. For a large dataset computing, the users could also download the trained model from https://github.com/zhibinlv/isGP-DRLF. By using UMAP(uniform manifold approximation and projection) method (Mcinnes ) feature visualization technology, we found out that UniRep feature could be better to represent proteins than other features (single feature type or fused features types) to distinguish proteins in cis-Golgi from those in trans-Golgi. Thus, isGP-DRLF could employ merely one type of feature representation, a protein sequence deep representation learning feature, to carry out a powerful prediction tool for sub-Golgi proteins localization, although it abandoned the feature fusion methods as many state-of-the-art models did.

2 Materials and methods

2.1 Datasets

There are several benchmark datasets (Ding , 2013; Yang ; Zhao ) with different sequence homologies and numbers for sub-Golgi protein identification modeling (see the details of datasets labeled D0, D1, D2, D3 and D4 in Supplementary Table S1). Considering the availability of the benchmark datasets and the performance of the sub-Golgi protein classifiers based on various datasets, we used benchmark dataset D3 for model training and D4 for model independent testing. D3 has been widely used for several state-of-the-art protein sub-Golgi localization classifiers (Ahmad , 2019; Rahman ; Yang ), available at https://www.mdpi.com/1422-0067/17/2/218 and D4 for independent testing available at http://lin-group.cn/server/SubGolgi/data. To prevent from homology or sequence similarity bias and from overfitting due to insufficient data entries, we have created a new updating training benchmark dataset downloaded from the latest Universal Protein KnowledgeBase (UniProtKB Version 2020_05) (Jiang ). We firstly searched protein sequence on the website https://www.uniprot.org/locations/ by using searching keywords listed in Supplementary Table S1 footnote a. Please see the dataset set up searching example in Supplementary Figure S0 in Supplementary Materials. Then we got a temporary dataset. The D4 was merged into temporary dataset. Then we removed the redundant sequences from the temporary dataset using PSI-CD-HIT (Jiang ) with a 25% identity cutoff. Finally, the independent testing sequences in D4 were excluded from the temporary dataset to avoid overfitting, and the new benchmark dataset D5 was created, which consists of 82 cis-Golgi proteins and 1065 trans-Golgi proteins for training. D5 was included in the Supplementary Materials. For real application, we download Human sub-Golgi proteome sequences from UniProtKB_2020_05 by using key words listed in Supplementary Table S1 footnote b.

2.2 Feature representation

Here we used protein sequence unified representation (UniRep) (Alley ) to convert protein sequences into feature vectors. UniRep was proposed in 2019 and uses a totally different feature representation method than the previously and extensively used protein sequence feature extraction methods (Ahmad , 2019; Lv ; Rahman ; Yang ; Zhou ). Details of UniRep are found in reference (Alley ). Briefly, UniRep used a slightly modified version of the original multiplicative long short-term memory architecture (mLSTM) (Krause ) in Figure 1 as deep representation learner to self-supervised learning for prediction of the next following amino acid in a sequence. UniRef50 downloaded from UniProt was used to train UniRep to perform next amino-acid prediction in the protein sequences (Bateman ). The protein feature vectors derived from UniRep have been used as the input for secondary structure, stability, diverse functions and semantic similarity of protein prediction or clustering, resulting in more efficiency for protein engineering tasks (Alley ; Qi ). In this work, each of sub-Golgi-localized protein sequences were firstly converted into an integer sequence according to following function: where is the jth amino acid of the sequence, and is one of the canonical and non-canonical amino-acid symbols (X, B, Z, J). The integer sequence (length of protein sequence) was embedded into 1900-long feature vectors via the UniRep method for future supervised prediction. The 1900 dimension features are calculated from the average output hidden states in the UniRep model. The UniRep calculation details please see Alley . For comparison, some preprint works of the state-of-the-art of protein sequence deep representation embed features including TAPE from Rao and BiLSTM embedding from Bepler were used. See the Supplementary Text S1 for the details.

Fig. 1.

Modeling overview. The Golgi protein sequence is firstly convert into 1900 D features by use of the deep representation learning model, UniRep. Then 1900 D features are fed into ten classifiers; or 1900 D feature vectors are filtered by LGBM feature selection technology to reduce into 250 dimension vectors, which then fed into ten classifiers with SMOTE or not. In the next step, the top 2 classifiers are selected for further optimization with LGBM, ANOVA and MRMD feature selection. Finally, the optimal model (SVM) is used in the isGP-DRLF webserver

2.3 Feature selection

Compared to 304 training samples in D3, there are too many redundant features for a 1900-dimension UniRep feature vector of each protein sequence, which would result in overfitting of the machine learning model. In this study, three feature selection techniques were proposed to filter valid features for sub-Golgi classification. The first was ANOVA, which sorted features by measuring the ratio of their variance between and within groups (Blanca ; Tang et al., 2016). ANOVA has been widely used in bioinformatics, medical research and other fields (Jung ; Tavakkolkhah et al., 2018). The second was the LGBM algorithm (Ke ), which selected the best feature space based on feature importance values calculated by the LGBM model. Recently, LGBM feature selection methods were well applied to RNA pseudouridine site and DNA N-4-Methycytosine sites prediction (Lv ; Lv ). LGBM is available at https://lightgbm.readthedocs.io. The third was max relevance max distance(MRMD) method (http://lab.malab.cn/soft/MRMD3.0/index.html) (Zou et al., 2016), which is an integrated tool based PageRank strategy to use multiple different popular feature ranking algorithms to determine the properly reduced feature space.

2.4 Imbalanced data processing

The training benchmark dataset D3 and D5 are class imbalanced datasets, in which the number of cis-Golgi proteins is much less than that of trans-Golgi proteins. Reported sub-Golgi classifiers have demonstrated such class imbalance to significantly impact real application performance (Lv ). That is, training results are more likely to identify the majority while ignoring the minority of classes. To overcome this and depending on sufficient quantity, under-sampling is used when sufficient sample is available and over-sampling is used when insufficient sample is available (Fernandez ). In the case of sub-Golgi classification, over-sampling is often applied using SMOTE (Barua ), which has been integrated as a module in the imbalanced-learn toolkit (https://github.com/scikit-learn-contrib/imbalanced-learn) (Lemaitre ).

2.5 Classifiers

In order to determine the most suitable machine learning algorithm, we tested ten popular machine learning algorithms. They were Logistic Regression (LR), KNN, Decision Tree (DT), Gaussian Naive Bayes (GB), Bagging, Random Forest (RF) (Shi et al., 2019; Wang et al., 2019a, 2020b), Ada Boosting (AB), Light Gradient Boosting Machine (LGBM), Support Vector Machine (SVM) (Huo ; Wang et al., 2019b) and Linear Discriminant Analysis (LDA). They are out-of-the-box tools in the scikit-learn toolkit (https://github.com/scikit-learn/) (Pedregosa ). The default hyper-parameters were used for first-round classifier filtering and the top 2 classifiers were selected for hyper-parameter optimization to determine the optimal classifier. Please see the detail in the Supplementary Text S2.

2.6 Evaluation metrics and methods

As for most binary classification machine learning methods, four standard metrics including accuracy(ACC), sensitivity (Sn), specificity(Sp) and Matthew correlation coefficient(MCC) were adopted to evaluate the performance of the trained models (Ding , 2019; Hong ; Li ; Yang et al., 2018, 2019a; Zeng et al., 2018, 2019). They were calculated as equation (2) to (5). For TP, TN, FP and FN when short of the predicted sample number of true positive, true negative, false positive and false negative. In this study, we denoted proteins at the cis-Golgi location as positive samples and proteins at the trans-Golgi location as negative samples. The area value (auROC) under the receiver operating characteristic (ROC) curve was also applied for model evaluation (Dao ,b; Feng ; Zhang et al., 2008). ROC was constructed by plotting the true positive rate with respect to the false positive rate, with auROC ranging from 0 to 1. An auROC of 1 indicated prefect prediction while an auROC of 0.5 indicated random predictions for both positive and negative samples. In addition to the evaluation metrics, 10-fold and leave-one-out (LOO) cross validation (Deng ; Zhang et al., 2019a,b) and independent testing protocols were utilized (Jiao ). Please see the detail in the Supplementary Text S3.

3 Results

3.1 Initial performances of different classifiers

First, we used ten widely used machine learning algorithms to classify cis-Golgi and trans-Golgi protein sequences represented by the 1900-dimension deep learning features without data balancing or feature selection. The 10-fold cross-validation accuracy box plotting and ROC curves of these ten classifiers are shown in Figure 2A and B. The support vector machine classifier with the best average accuracy was 77.3% with an auROC value of 0.765, and an MCC value of 0.379 (see Supplementary Table S2). Here the low value of ACC, MCC and auROC might be caused by data unbalancing, that was the positive sample and negative samples were far beyond equal.

Fig. 2.

Ten-fold cross-validation accuracy metrics of Boxplots and ROC curves for ten classifiers (LR: Logistic Regression, KNN: K-nearest Neighbors, DT: Decision Tree, NB: Gaussian Naive Bayes, Bagging: Bagging, RF: Random Forest, AB: Ada Boosting, LGBM: Light Gradient Boosting Machine, SVM: Supporting Vector Machine, LDA: Linear Discriminant Analysis) using different feature processing technologies. A and B utilized UniRep feature vectors with 1900 dimensions; C and D used SMOTE to balance the UniRep feature vectors with 1900 dimensions; for E and F, based on the previous steps, 250 features were selected by using the LGBM feature selection method. Green Triangles and orange lines in A, C and E are the average accuracy values and the median accuracy values for the 10-fold cross-validation. In either case, SVM classifier had the highest average accuracy (77.32%, 90.31% and 90.76%, respectively) and the highest average auROC value (0.765, 0.940 and 0.958, respectively) Second, to further improve the classifiers’ accuracy, we then used SMOTE to balance the positive and negative samples with results shown in Figure 2C, D and Supplementary Table S2. After SMOTE data balancing processing, except ACC for KNN model, all 10-fold cross-validation evaluation metrics were greatly improved, especially the MCC values. The SVM classifier was again top ranked for its high accuracy and auROC value. The training dataset D3 contains 304 sequence with 87 cis-Golgi located protein sequence and 217 trains-Golgi located protein sequences. Obviously, the 1900 dimensional features are far more than the training sample numbers that would cause feature dimension redundancy and potential overfitting, so for the third step, we used LGBM feature selection technology to sort the 1900 deep representation learning features by their importance value with selection of the top 250 features for model fitting. The results are shown in Figure 2E, F and Supplementary Table S2. After feature dimension reduction, performance of some machine learning algorithms declined. For example, the accuracy of Gaussian Naive Bayes (NB), LR and AB decreased by the absolute value of 14%, 0.7% and 0.85%, respectively. The accuracy of the other seven classifiers rose by the absolute values ranging from 0.4% to 7.5%, among which SVM ranked top at ACC = 90.76%. LGBM ranked second best at ACC = 90.57%. Since the performance SVM and LGBM were quite close, both were chosen for subsequent optimization and comparison.

3.2 Effect of feature selection technologies on SVM and LGBM classifiers

To determine the optimal feature space for SVM and LGBM classifiers, we used a two-step feature optimizing strategy. In the first step, we used three feature selection technologies, (ANOVA, MRMD and LGBM) to calculate the feature importance values like F-score for ANOVA, page-ranked values for MRMD and Gini-based feature importance values for LGBM to yield three descending order lists for deep representation learning features. In the second step, for each feature list obtained by different methods, we selected the top 200 features and used a feature-by-feature incremental strategy to determine the optimal feature vector space for SVM and LGBM classifiers. The 10-fold cross-validation accuracy results are shown in Figure 3A. For example, the black curve with the ANOVA_SVM legend is the accuracy of the SVM model varying by the increasing number of features selected by the ANOVA method.

Fig. 3.

(A) Based on benchmark dataset D3, the average 10-fold cross-validation accuracy varied with the feature numbers for LGBM and SVM classifiers based on ANOVA, MRMD and LGBM feature selection technology. The best SVM had an accuracy of 92.16% with 158 features. The best LGBM classifier had an accuracy of 93.08% with 64 features. Both were based on LGBM feature selection technology. (B) Ten-fold cross-validation and LOO metrics for comparison of the best SVM (based on benchmark dataset D3 and D5) and LGBM classifier (based on benchmark dataset D3). (C) Independent test metrics on benchmark testing dataset D4 for the best SVM and LGBM classifier obtained by LOO using benchmark dataset D3 and D5 Initially, as the number of features increased, each accuracy curve for the model increased sharply and then approached fluctuating plateaus. For the accuracy value of each curve plateau, both SVM and LGBM classifiers with feature space were determined by LGBM feature selection (the LGBM_SVM model is the purple curve and the LGBM_LGBM model is the golden curve in Fig. 3A). Each was more accurate than those based on ANOVA and MRMD methods. The maximum accuracy of the LGBM_LGBM model was 93.08% with 64 features and was more than the 92.16% for the LGBM_SVM model with 158 features. The accuracy of the LGBM_LGBM model were less than the LGBM_SVM model with feature numbers ranging 140 to 200. Considering the stability, robustness and generalization of the model, LGBM_SVM was selected as the final prediction model based on the following facts. The fluctuating purple and golden curves as well as the violin-box plotting insets, which are statistical accuracy values for feature numbers ranging from 35 to 200, are shown in Figure 3A. For the LGBM_SVM model, the accuracy fluctuated less with a smaller deviation. Since the LOO cross-validation model is more robust and stable, we compared two models by different cross-validation evaluation methods. Despite the 10-fold cross validation metrics (ACC, MCC, Sn and auROC, but not Sp) for the LGBM_LGBM (LGBM_10Fold) model were better than the LGBM_SVM (SVM_10 Fold) model, as shown in the bar graph in Figure 3B, the LOO cross validation metrics (ACC, MCC, Sn, Sp and auROC) for LGBM_SVM (SVM_LOO) exceeded those of LGBM_LGBM (SVM_LOO). Moreover, the independent testing scores for SVM (SVM_LOO) were greater than the scores for LGBM (LGBM_LOO) as shown in Figure 3C. Although Yang ) have been considering that the training dataset D3 and test dataset D4 may be similar sequences shared by them, they have used CD-HIT tool to obtain sequences with a 40% identity, resulting in D3 dataset. However, the sample numbers of training dataset D3 is not enough to eliminate the overfitting of training model. As we could see from learning curves in Supplementary Figure S2A and B, model SVM-D3_1900Features (with 1900 features) and SVM-D3_158Features (with 158 selected features) are overfitting to some extent. Although the overfitting degree of SVM-D3_158Features is lower than that of SVM-D3_1900Features after feature selection and SMOTE, the overfitting is still present and it should not be ignored as the overfitting affects the reliability and robustness of the applied models. In this work to overcome overfitting, the means of increasing the number of training dataset samples and decreasing sequence homology were adopted. That is, we construct a new dataset D5 with homology identity value smaller than 25% using PSI-CD-HIT as described in Section 2.1. Based on new benchmark dataset D5, we have used the modeling flow as showed in Figure 1 to optimize the SVM model combined with SMOTE and LGBM feature selection technology. As results displayed in Figure 3B, the 10-fold and LOO cross-validation scores SVM model (SVM-D5) with 107 features (see Supplementary Fig. S1) trained on D5 were of great improvement over those of SVM model (SVM-D5) with 158 features trained on D3. For instance, the 10-fold and LOO accuracy of SVM-D5 were both 99.2%, which was increased by the relative values of 7.59% and 7.12% to the values of SVM-D3 (92.2% and 92.6%). Evidently showing in Figure 3B, the training data volume increasing from 304 sequences in D3 to 1147 sequences in D5, the performance of SVM-D5 improved markedly over that of SVM-D3. Furthermore, while the increase of training data volume, the over-fitting of SVM-D5 (with 158 features) was greatly reduced compared with SVM-D3 as showed in the learning-curves of Supplementary Figure S2. Obviously, the overfitting of SVM-D5 with 107 features trained on D5 was overcome and could be ignored as it can be seen from Supplementary Figure S2.

3.3 Comparison with the state-of-the-art deep representation feature types

For comparison, four types of deep representation feature from two preprint works were used for selection of best feature type for the sub-Golgi prediction task. The leave-one-out cross-validation and testing results of SVM model trained on D3 and D5 dataset using BiLSTM-lm, BiLSTM-ssa, TAPE-pooled and TAPE-avg features respectively are listed in Table 1. For models trained on D3, the model using UniRep features obtained the best leave-one-out cross-validation accuracy scores and independent testing scores. For models trained on D5, the model using UniRep features got the value 99.2% of leave-one-out cross-validation accuracy, a little smaller than the value of models using BiLSTM-lm, BiLSTM-ssa, TAPE-avg by the relative values of 0.2%, 0.6% and 0.5%, separately. While the independent testing accuracy of model using UniRep feature is 96.4%, much greater than the value of models using BiLSTM-lm, BiLSTM-ssa, TAPE-avg by the relative values of 4.5%, 10.1% and 8.2%.

Table1.

Evaluation metrics comparisons of support vector machine classifiers based on different state-of-the-art deep representation learning features

Feature type	Trained dataset (identity)	Feature dimensions	LOO cross-validation					Independent testing
Feature type	Trained dataset (identity)	Feature dimensions	ACC (%)	MCC	Sn (%)	Sp (%)	auROC	ACC (%)	MCC	Sn (%)	Sp (%)	auROC
UniRep	D3 (40%)	158	92.6	0.85	94.9	90.3	0.964	98.4	0.95	100	98.0	0.995
UniRep	D5 (25%)	107	99.2	0.98	100	98.4	0.999	96.4	0.90	100	84.6	0.994
BiLSTM-lm	D3 (40%)	77	88.7	0.78	85.2	92.1	0.917	92.1	0.75	96.1	76.9	0.983
BiLSTM-lm	D5 (25%)	152	99.4	0.99	99.9	98.9	0.999	92.2	0.75	100	61.5	0.989
BiLSTM-ssa	D3 (40%)	93	91.7	0.83	90.7	92.6	0.946	90.6	0.71	94.1	76.9	0.975
BiLSTM-ssa	D5 (25%)	48	99.8	0.99	100	99.7	0.999	87.5	0.58	100	38.5	0.956
TAPE-pooled	D3 (40%)	77	90.3	0.81	89.9	90.7	0.941	90.6	0.70	96.0	69.2	0.966
TAPE-pooled	D5 (25%)	53	98.7	0.97	100	97.5	0.999	90.6	0.69	98.0	61.5	0.927
TAPE-avg	D3 (40%)	67	91.9	0.84	94.0	89.9	0.963	96.4	0.91	100	96.1	0.985
TAPE-avg	D5 (25%)	73	99.7	0.99	100	99.3	0.999	89.1	0.64	100	46.1	0.989

Evaluation metrics comparisons of support vector machine classifiers based on different state-of-the-art deep representation learning features Considering both scores of leave-one-out cross-validation and independent testing of different models based on the 5 types of feature listed in Table 1, the SVM model with 107 deep representation learning features (after leave-one-out cross validation) is the final optimal model on the webserver, isGP-DRLF (Identify Sub-Golgi Protein via Deep Representation Learning Features).

3.4 Comparison with the state-of-the-art classifiers

To further evaluate the performance of our classifiers, we compared isGP-DRLF with state-of-the-art classifiers in Supplement Table S3. Supplementary Table S3 summarizes the LOO cross-validation metrics of published classifiers based on different benchmark training dataset D0, D1, D2 and D3 listed in Supplement Table S1. Classifiers based on the D3 dataset were superior to other classifiers based on D0, D1 and D2. Apart from the isGP-DRLF of this study and Ding’s SVM model (Ding ), other reported sub-Golgi classifiers had to fuse multi-type features to attain acceptable results. The isGP-DRLF was far better than Ding’s SVM model (Ding ) as judged by LOO cross-validation (Supplementary Table S3) and independent testing scores (Supplementary Table S4). The six models based on the D3 training are shown in Supplement Tables S3 and S4 and in the main text Table 2. KNN sub-Golgi localization protein classifier developed by Ahmad had Split Amino Acid Composition (SAAC), 3gap Dipeptide Composition (3gDPC) and Position Specific Scoring Matrix (PSSM) fusion features and realized a best LOO accuracy of 98.2% with independent test accuracy of 94.0%. Our isGP-DRLF based on D5 had 107 UniRep features and achieved 99.2% LOO accuracy and independent test accuracy of 96.4%. Evidently, given the independent testing results of the models based on D3 and D5 in this study, isGP-DRLF is better at predicting unknown sub-Golgi protein sequences.

Table2.

Evaluation metrics comparisons of the state-of-the-art classifiers

Classifier	Trained dataset	Feature type numbers	Features dimensions	LOO Cross-validation				Independent testing
Classifier	Trained dataset	Feature type numbers	Features dimensions	ACC (%)	MCC	Sn (%)	Sp (%)	ACC (%)	MCC	Sn (%)	Sp (%)
SVM (this study)	D3	1	158	92.6	0.85	94.9	90.3	98.4	0.95	100	98.0
SVM (this study)	D5	1	107	99.2	0.98	100%	98.4	96.4	0.90	100	84.6
KNN (Ahmad et al., 2017)	D3	3	83	94.9	0.90	97.2	92.6	94.8	0.86	94.0	93.9
KNN (Ahmad et al., 2019)	D3	3	180	98.2	0.96	98.6	97.7	94.0	0.84	81.5	96.9
RF (Yang et al., 2016b)	D3	4	55	88.5	0.68	88.9	88.0	93.8	0.82	92.3	94.1
SVM (Rahman et al., 2018)	D3	6	2800	95.9	0.92	95.9	92.6	95.3	0.85	84.6	98.0

Evaluation metrics comparisons of the state-of-the-art classifiers Considering the independent testing dataset D4 consisting of 64 sequence and to test the model practical application capacity, we also applied isGP-DRLF to the human sub-Golgi proteome dataset with 423 sequence. The sequence location distribution is drawn in Figure 4A. It attained 91.8% accuracy for reviewed human sub-Golgi proteome while it predicted that about 28% unreviewed human sub-Golgi proteome would be located in cis-Golgi and 72% would be located in trans-Golgi (see Fig. 4B and D). While among state-of-the-art predictors listed in Table 2, now only Lin’s subGolgi2 webserver is available (http://lin-group.cn/server/subGolgi2) (Ding ). The subGolgi2 is a predictor using 2 gap dipeptide composition (2gDPC) features and we also tested it with human sub-Golgi proteome dataset and results are shown in Figure 4C and D. For reviewed sequences, subGolgi2 got accuracy of 77.3%, by the relative value of 16% lower than that of isGP-DRLF.

Fig. 4.

Human sub-Golgi proteome sequence distribution and the results of isGP-DRLF and suGolgi2 tested on human sub-Golgi proteome dataset

Human sub-Golgi proteome sequence distribution and the results of isGP-DRLF and suGolgi2 tested on human sub-Golgi proteome dataset To explain the difference between the models and feature representation capability, we used UMAP(uniform manifold approximation and projection) method (Mcinnes ) to reduce the UniRep feature space and 2 gap dipeptide composition features space dimensions into two and the dimension reduction results were shown in in the Supplementary Figure S3. From Supplementary Figure S3A and B, it found out that it was better to use UniRep features than use 2gDPC features for identifying proteins located in cis-Golgi from proteins located in trans-Golgi. Also, as displayed in Supplementary Figure S3A to H for protein sub-Golgi localization task, UniRep features are superior to some widely used feature types listed in Supplementary Table S3. The strong sequence feature representation capability of UniRep enables us to use only one feature type to achieve good classification accuracy without using features fusion technology.

3.5 Webserver implementation

The isGP-DRLF is now available at http://isGP-DRLF.aibiochem.net and its interface is shown in Supplementary Figure S4. The webserver allows input of FASTA format protein sequences and identifies whether a protein is found in cis-Golgi or trans-Golgi. The users could paste FASTA format Golgi protein sequence in the left blank box and click the submit button to calculate. After a while, the prediction results will be showed in the right table. Before starting a new task, the users have to clear the input box first to reactivate the submit button and then paste new sequence in the blank input box. Due to limited computing resources, do not input more than 5 sequences at a time. For larger dataset computing, the users could download the python script and the trained model from https://github.com/zhibinlv/isGP-DRLF.

4 Conclusion

A novel state-of-the-art sub-Golgi protein localization classifier, isGP-DRLF, was developed by use of protein sequence deep representation learning feature vectors. Combined with over-sampling, imbalanced data processing and LGBM feature selection, isGP-DRLF achieved the best independent testing evaluation accuracy values (ACC = 96.4%) for sub-Golgi protein prediction, which increases by 1.15% relative to the corresponding best value (ACC = 95.3%) for the previously reported models isGPT (Rahman ) listed in Table 2. For independent testing error rate, the absolute error rate value of isGP-DRLF is 3.6%, which reduced by 23.4% relative to the absolute error rate value 4.7% of the previous best-reported model isGPT (Rahman ). The isGP-DRLF employs just one type of sequence representation feature with performance superior to other state-of-the-art sub-Golgi classifiers fusing multiple types of feature representations. This study shows that protein sequence deep representation learning is a type of very discriminating feature vector that distinguishes multiple sequences, which will be useful in the future for a more accurate prediction of protein multi-subcellular localization (Armenteros ) and for recognition of protein chemical modification sites, signal peptide without a requirement for multi-type feature fusion (Armenteros ; Yang et al., 2016a). The current model only applies to sub-Golgi prediction. In the future, we will apply deep presentation learning features for eukaryotic proteins multiple subcellular and suborganellar localization prediction, functioning like DeepLoc (Armenteros ) or MULocDeep (Jiang ). Click here for additional data file.

60 in total

1. Predicting Golgi-resident protein types using pseudo amino acid compositions: Approaches with positional specific physicochemical properties.

Authors: Ya-Sen Jiao; Pu-Feng Du
Journal: J Theor Biol Date: 2015-12-15 Impact factor: 2.691

2. LSDT: Latent Sparse Domain Transfer Learning for Visual Adaptation.

Authors: Lei Zhang; Wangmeng Zuo; David Zhang
Journal: IEEE Trans Image Process Date: 2016-03 Impact factor: 10.856

3. Protein Function Prediction: From Traditional Classifier to Deep Learning.

Authors: Zhibin Lv; Chunyan Ao; Quan Zou
Journal: Proteomics Date: 2019-07-11 Impact factor: 3.984

4. IMPRes-Pro: A high dimensional multiomics integration method for in silico hypothesis generation.

Authors: Yuexu Jiang; Duolin Wang; Dong Xu; Trupti Joshi
Journal: Methods Date: 2019-06-17 Impact factor: 3.608

5. iRNA-2OM: A Sequence-Based Predictor for Identifying 2'-O-Methylation Sites in Homo sapiens.

Authors: Hui Yang; Hao Lv; Hui Ding; Wei Chen; Hao Lin
Journal: J Comput Biol Date: 2018-08-16 Impact factor: 1.479

6. PHYPred: a tool for identifying bacteriophage enzymes and hydrolases.

Authors: Hui Ding; Wuritu Yang; Hua Tang; Peng-Mian Feng; Jian Huang; Wei Chen; Hao Lin
Journal: Virol Sin Date: 2016-08 Impact factor: 4.327

7. DeepLoc: prediction of protein subcellular localization using deep learning.

Authors: Jose Juan Almagro Armenteros; Casper Kaae Sønderby; Søren Kaae Sønderby; Henrik Nielsen; Ole Winther
Journal: Bioinformatics Date: 2017-12-15 Impact factor: 6.937

8. Non-normal data: Is ANOVA still a valid option?

Authors: María J Blanca; Rafael Alarcón; Jaume Arnau; Roser Bono; Rebecca Bendayan
Journal: Psicothema Date: 2017-11

9. HBPred: a tool to identify growth hormone-binding proteins.

Authors: Hua Tang; Ya-Wei Zhao; Ping Zou; Chun-Mei Zhang; Rong Chen; Po Huang; Hao Lin
Journal: Int J Biol Sci Date: 2018-05-22 Impact factor: 6.580

10. UniProt: a worldwide hub of protein knowledge.

Authors:
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

12 in total

1. A polygenic stacking classifier revealed the complicated platelet transcriptomic landscape of adult immune thrombocytopenia.

Authors: Chengfeng Xu; Ruochi Zhang; Meiyu Duan; Yongming Zhou; Jizhang Bao; Hao Lu; Jie Wang; Minghui Hu; Zhaoyang Hu; Fengfeng Zhou; Wenwei Zhu
Journal: Mol Ther Nucleic Acids Date: 2022-04-06 Impact factor: 10.183

2. prPred: A Predictor to Identify Plant Resistance Proteins by Incorporating k-Spaced Amino Acid (Group) Pairs.

Authors: Yansu Wang; Pingping Wang; Yingjie Guo; Shan Huang; Yu Chen; Lei Xu
Journal: Front Bioeng Biotechnol Date: 2021-01-21

3. SYNBIP: synthetic binding proteins for research, diagnosis and therapy.

Authors: Xiaona Wang; Fengcheng Li; Wenqi Qiu; Binbin Xu; Yanlin Li; Xichen Lian; Hongyan Yu; Zhao Zhang; Jianxin Wang; Zhaorong Li; Weiwei Xue; Feng Zhu
Journal: Nucleic Acids Res Date: 2022-01-07 Impact factor: 16.971

Review 4. MoRF-FUNCpred: Molecular Recognition Feature Function Prediction Based on Multi-Label Learning and Ensemble Learning.

Authors: Haozheng Li; Yihe Pang; Bin Liu; Liang Yu
Journal: Front Pharmacol Date: 2022-03-08 Impact factor: 5.810

5. PredMHC: An Effective Predictor of Major Histocompatibility Complex Using Mixed Features.

Authors: Dong Chen; Yanjuan Li
Journal: Front Genet Date: 2022-04-25 Impact factor: 4.772

6. Identification of plant vacuole proteins by exploiting deep representation learning features.

Authors: Shihu Jiao; Quan Zou
Journal: Comput Struct Biotechnol J Date: 2022-06-08 Impact factor: 6.155

Review 7. Recent Advances in Predicting Protein S-Nitrosylation Sites.

Authors: Qian Zhao; Jiaqi Ma; Fang Xie; Yu Wang; Yu Zhang; Hui Li; Yuan Sun; Liqi Wang; Mian Guo; Ke Han
Journal: Biomed Res Int Date: 2021-02-09 Impact factor: 3.411

8. Prediction of Metal Ion Binding Sites of Transmembrane Proteins.

Authors: Jing Qu; Sheng S Yin; Han Wang
Journal: Comput Math Methods Med Date: 2021-10-22 Impact factor: 2.238

Review 9. AOPM: Application of Antioxidant Protein Classification Model in Predicting the Composition of Antioxidant Drugs.

Authors: Yixiao Zhai; Jingyu Zhang; Tianjiao Zhang; Yue Gong; Zixiao Zhang; Dandan Zhang; Yuming Zhao
Journal: Front Pharmacol Date: 2022-01-18 Impact factor: 5.810

10. iThermo: A Sequence-Based Model for Identifying Thermophilic Proteins Using a Multi-Feature Fusion Strategy.

Authors: Zahoor Ahmed; Hasan Zulfiqar; Abdullah Aman Khan; Ijaz Gul; Fu-Ying Dao; Zhao-Yue Zhang; Xiao-Long Yu; Lixia Tang
Journal: Front Microbiol Date: 2022-02-22 Impact factor: 5.640