Literature DB >> 34864897

Predicting the multi-label protein subcellular localization through multi-information fusion and MLSI dimensionality reduction based on MLFE classifier.

Yushuang Liu^1,2, Shuping Jin^1,2, Hongli Gao^1,2, Xue Wang^1,2, Congjing Wang^1,2, Weifeng Zhou^1,2, Bin Yu^1,2.

Abstract

MOTIVATION: Multi-label protein subcellular localization (SCL) is an indispensable way to study protein function. It can locate a certain protein (such as the human transmembrane protein that promotes the invasion of the SARS-CoV-2) or expression product at a specific location in a cell, which can provide a reference for clinical treatment of diseases such as COVID-19.
RESULTS: The paper proposes a novel method named ML-locMLFE. First of all, six feature extraction methods are adopted to obtain protein effective information. These methods include pseudo amino acid composition (PseAAC), encoding based on grouped weight (EBGW), gene ontology (GO), multi-scale continuous and discontinuous (MCD), residue probing transformation (RPT) and evolutionary distance transformation (EDT). In the next part, we utilize the multi-label information latent semantic index (MLSI) method to avoid the interference of redundant information. In the end, multi-label learning with feature induced labeling information enrichment (MLFE) is adopted to predict the multi-label protein SCL. The Gram-positive bacteria dataset is chosen as a training set, while the Gram-negative bacteria dataset, virus dataset, newPlant dataset and SARS-CoV-2 dataset as the test sets. The overall actual accuracy (OAA) of the first four datasets is 99.23%, 93.82%, 93.24%, and 96.72% by the leave-one-out cross validation (LOOCV). It is worth mentioning that the OAA prediction result of our predictor on the SARS-CoV-2 dataset is 72.73%. The results indicate that the ML-locMLFE method has obvious advantages in predicting the SCL of multi-label protein, which provides new ideas for further research on the SCL of multi-label protein.
AVAILABILITY AND IMPLEMENTATION: The source codes and data are publicly available at https://github.com/QUST-AIBBDRC/ML-locMLFE/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Year: 2021 PMID： 34864897 PMCID： PMC8690230 DOI： 10.1093/bioinformatics/btab811

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

The structure and function of protein are various, but they can only play a role in the right subcellular localization (SCL) (Chu ; Costa ). When protein structure changes, it will cause diseases, such as kidney disease (Ivanova ), myocarditis (Jang ), diabetes (Brownlee, 1995), dermatomyositis (Brownlee, 1995) and muscle atrophy (Sneddon ). With the continuous increase of data and the expansion of research directions (Wan , 2017), the traditional machine learning methods cannot achieve good prediction results (Marilyn ; Zhang c). First, the traditional machine learning methods are time-consuming and labor-intensive. Second, the protein not only exists in one SCL, but also may exist in two or multiple SCL. The prediction method for a single protein site ignores the situation of two or more subcellular locations (Du ). Finally, the high-dimensional space formed by multi-information fusion increases the interference of redundant information on the prediction results (Yu ). Therefore, this article mainly optimizes the feature extraction, feature selection and classifier to improve the prediction accuracy. Since protein sequences cannot be directly used for calculation, it must be transformed into digital information for further study (Yu ). Zhang c) utilized dipeptide composition (DC), pseudo position-specific scoring matrix (PsePSSM), pseudo amino acid composition (PseAAC), gene ontology (GO) and encoding based on grouped weight (EBGW) to extract protein information from relevant datasets. Wu adopted GO and evolutionary information to develop a new predictor iLoc-Gpos in Gram-positive bacteria dataset. Wan used the relationship between GO terms to predict the SCL of plant dataset. Zhang ) used position-specific scoring matrix-transition probability composition (PSSM-TPC), DC, GO, PseAAC, PsePSSM and differential evolution algorithm to assign five single feature weight vector. The feature fusion method can combine multiple information of protein sequences (Yu ). But the interference of redundant information on the prediction results will gradually increase (Fan ; Shi ) with the increase of dimension. To eliminate the useless features in the original space, researchers have proposed a variety of dimensionality reduction methods. Zhang put forward a manifold regularized discriminant feature selection (MDFS) algorithm to improve performance by optimizing feature selection framework and considering label correlation. Zhang and Zhou (2010) suggested a multi-label (ML) dimension reduction method based on dependency maximization (MDDM) to maximize the dependency of original features and related category labels to make the dimension reduction process more efficient. Xu came up with the ML feature extraction algorithm via feature variance and feature-label dependence (MVMD) method, which integrated two least squares formulas and used the maximum feature variance and the correlation of feature label to select the best feature vector. Zhang ) presented global relevance and redundancy optimization (GRRO) method composed of feature relevance, label relevance and feature redundancy, which greatly improved computing efficiency. Choosing a suitable classifier is crucial for predicting the SCL of proteins. Wan et al. (Wan and Mak, 2018; Wang ) proposed an adaptive decision-making scheme for support vector machines (AD-SVM) to obtain the overall actual accuracy (OAA) on virus dataset was 93.24%, and the overall location accuracy (OLA)was 96.03%. Wang used the ensemble multiple classifier chain (ECC) to predict the protein SCL of Gram-negative bacteria dataset, and the OAA was 94.03%, the OLA was 94.46%. Shen used the multi-kernel SVM classifier to predict the two human datasets ML protein SCL, and the average precision reached 70.65% and 68.89%, respectively, compared with the results of other methods, the result was the best. To improve the accuracy of prediction, we propose a new model called ML-locMLFE to predict the SCL of ML protein. Six feature extraction methods are used to transform protein sequences into digital information. Therefore, this article needs to fuse six types of feature information. Then we use the ML information latent semantic index (MLSI) to classify and recognize the most effective information from many features. Finally, the ML learning with feature induced labeling information enrichment (MLFE) is utilized to predict the SCL of ML protein. Compared with other methods, ML-locMLFE is more superior in predicting the SCL of ML protein.

2 Materials and methods

2.1 Datasets

Five datasets are used to verify the effectiveness of the model. The Gram-positive bacteria dataset (Dehzangi ) is the training set, while the Gram-negative bacteria dataset (Dehzangi ), the virus dataset (Shen and Chou, 2010), the SARS-CoV-2 dataset (Zhang b) and the newPlant dataset (Wan ) are the test sets together. The Gram-positive bacteria dataset, Gram-negative bacteria dataset, virus dataset come from the Swiss-Prot database and the breakdown of each dataset is shown in Supplementary Tables S1–S3. We have obtained data from the UniProt database of the past 3 years to construct a new plant dataset (named as the newPlant dataset). The detailed breakdown is given in Supplementary Table S4. As a newly mutated coronavirus, the SARS-CoV-2 can cause great harm to human health. Therefore, the accurate identification of the SCL of the SARS-CoV-2 protein is helpful to analyze the pathogenic mechanism of the virus. The SARS-CoV-2 dataset is constructed from the UniProt database, and the detailed breakdown is shown in Supplementary Table S5. The homology of the five datasets is <25%.

2.2 Feature encoding

The quality of features has a crucial impact on the predictive ability of the model. Therefore, a suitable feature extraction method is an extremely critical step in predicting the SCL of ML protein. Six methods, namely PseAAC, EBGW, GO, residue probing transformation (RPT), evolutionary distance transformation (EDT) and multi-scale continuous and discontinuous (MCD), are adopted here. PseAAC: PseAAC is a commonly used feature extraction method to predict SCL. According to Chou (2001), PseAAC mainly reflects the protein sequence information (Bahar ; Sahu ; Zhang ). The algorithm can be expressed by where represents the level sequence correlation factor, represents the frequency of the th amino acid in the protein, is the weighting factor and the value selected in this article is 0.05 (Chou, 2001). Because is the characteristic parameter, a -dimensional feature vector will be formed finally. EBGW: The physical and chemical properties is one of the important properties of protein. Zhang proposed EBGW, which divided amino acids into four categories, as shown in Table 1.

Table 1.

The 20 amino acids are divided into four groups (K1–K4)

Group	Amino acids
Neutral and hydrophobic amino acids (K1)	A, F, G, I, L, M, P, V, W
Neutral and polarity amino acids (K2)	C, N, Q, S, T, Y
Acidic amino acids (K3)	K, H, R
Alkaline amino acids (K4)	D, E

The 20 amino acids are divided into four groups (K1–K4) Three disjoint combinations can be obtained from Table 1. According to formulas (3), (4) and (5), the protein sequences are converted into three binary sequences. The length of three binary sequences is . These sequences are divided into multiple subsequences and the subsequence length is progressive increase. Each will form -dimensional feature vector, so three binary sequences form -dimensional vector. GO: When using GO model to extract GO information of each protein sequence, it is usually divided into two steps (Huang ; Shen ). One is GO terms, and another is GO vector. The BLASTP is used to search from the Swiss-Prot database and retain homologous proteins (denoted as ) with a similarity ≥60% with protein . Parameter is set to 0.001 (Zhang ) in the above steps. In the GOA database, we searched for accession number (ACs) of each protein in , which are obtained from the Swiss-Prot database. Then, the corresponding GO terms were obtained (Xiao ). Then, GO feature vector is constructed as: here, is the size of dataset, . RPT: RPT is a feature extraction method that reflects the evolutionary information of protein sequences (Jeong ). In the PSSM, domains with similar conservations are grouped according to conservation scores (Wang ; Zhang ). Here, each particular columns corresponding are standard amino acids in the PSSM. The 20 amino acids are separated 20 groups as rows in the PSSM. Then, calculate the sum of PSSM values for each element in each column, and form a 20×20-dimensional matrix, which is the RPT matrix. The matrix is expressed as follow: Therefore, the matrix can be transformed into a 400-dimensional feature vector , where is obtained by equation (8). EDT: EDT is an effective method to calculate the non-occurrence information possibility of two amino acids (Jeong ). The two amino acids with interval of , where is the shortest sequence length in the dataset. The feature vector of EDT is denoted as: The is non-occurrence possibility of two amino acids with interval . It is calculated by formula (10): where are the element in the PSSM, are any 2 of the 20 amino acids, is the maximum value in d. MCD: Due to the influence factors of continuous and discontinuous fragments in protein sequences, You proposed the MCD feature extraction method. The method converts protein sequence into digital information by binary method. For example, a protein sequence ‘AVDCALSK’ is randomly selected and transformed into a digital model ‘11321476’ via MCD calculation. Then, the sequence is divided into 10 regions, thus composition (C), transition (T) and distribution (D) are used to represent protein characteristics and each descriptor can be calculated. Finally, a 630-dimensional feature vector is formed by all descriptors from 10 regions.

2.3 ML informed latent semantic indexing

Assuming the feature space contains samples, and each sample size is -dimensional feature vector, but we will reduce to dimension. MLSI (Yu ) defines the input matrix , where is the -dimensional feature vector. The output matrix and is the -dimensional feature vector. Kernel function represents inner product as: Similar kernel function is expressed as equation (12), and the kernel matrix is obtained. The kernel calculation matrix is as (13): Then, for generalized eigenvalue problems, where coefficient requires , . By formula (14), the generalized eigenvalues are calculated and the first eigenvalues are used as mappings. The ith mapping function can be obtained by scaling the eigenvalues: Then, and formula (15) is rewritten as: Finally, k-dimensional vector with the largest eigenvalue is selected.

2.4 ML learning with feature induced labeling information enrichment

If training sample is denoted as , is the number of training sample, and given the enriched labeling information , the original training sample can be transformed into . The response variables can measure the model through the multi-output regression technology. MLFE algorithm (Zhang ) uses minimization to obtain the objective function of the regression model: where, and represent weight matrix and deviation vector of regression model, respectively, and is the number of class label. To obtain the optimal objective function, Newton-weighted least squares iterative method (IRWLS) (Sanchez-Fernández ; Tsoumakas ) is used. In the iterative process, the descent direction of model optimization is determined by solving the linear solution of the equation. Let represents the current model after the th iteration, and equation (18) is obtained based on the first-order Taylor expansion. where and can be calculated under the current model . To identify the analytical solution of the descent direction, it is necessary to construct the quadratic approximation value of : where is a constant term.

2.5 Performance evaluation

The cross-validation method can avoid over-fitting to some extent. The commonly methods include K-fold cross-validation (Jia ), leave-one-out cross validation (LOOCV) (Yu ), self-compatibility method (Bringi ) and independent sample test (Heeren and D’Agostino, 1987). Compared with other cross-validation methods, LOOCV is deterministic and has high sample utilization (Cheng ). Therefore, the LOOCV test is introduced in this article to evaluate the effectiveness of the model with the OAA, OLA, hamming loss (HL), coverage (CV), ranking loss (RL), and average precision (AP) as indicators. Six evaluation indicators are defined as the following. where is the number of training sample. and represent prediction label and real label, . where is the number of labels. where makes all labels rank down and get the corresponding ranking. where and . is part of the label of , is a supplement to . where . This study proposes a new method ML-locMLFE for predicting the SCL of ML proteins and the detailed process is displayed in Figure 1.

Fig. 1.

Flowchart of ML-locMLFE prediction method. (i) Data preparation. Five datasets are obtained from Swiss-Prot and UniProtKB databases, then the corresponding protein sequences and real label are also achieved. (ii) Feature extraction. PseAAC, EBGW, MCD, RPT, EDT and GO are used to convert protein sequence information into digital information, then the six features vector are fused. (iii) Feature selection. The MLSI method identifies the most effective information to form the optimal feature subset. (iv) Model construction. Combining step (iii), the optimal feature subset is integrated into the classifier MLFE, and the ML-locMLFE model is constructed based on LOOCV. (v) Model evaluation. Gram-positive bacteria dataset is used to evaluate the effectiveness of ML-locMLFE, and Gram-negative bacteria dataset, virus dataset, SARS-CoV-2 dataset, newPlant dataset are used to verify the performance of ML-locMLFE. Both the training set and the test sets will choose OAA, OLA, HL, RL, AP and CV as evaluation indicators

3 Results

3.1 The result analysis of feature encoding parameters and

PseAAC and EBGW have different characteristic information by setting different parameters. Since the minimum length of all protein sequences of the Gram-positive bacteria dataset is 55, the parameter of PseAAC is set from 5 to 54 and the parameter of EBGW is set from 5 to 55. Through the LOOCV test, the characteristic information obtained from each parameter is put into the classifier MLFE, and the specific evaluation index values of the different parameter results are listed in the Supplementary Tables S6 and S7. The optimal OAA obtained from PseAAC and EBGW are 62.44% and 58.96%. The comparison results under different parameters are shown in Figure 2.

Fig. 2.

The OAA obtained by Gram-positive bacteria dataset under different parameters. (A) The OAA reaches its maximum at , therefore, is selected as the optimal parameter of PseAAC and forms a dimensional vector. (B) The OAA reaches its maximum at , therefore, we select as the best parameter of EBGW and form a dimensional vector

3.2 The influence of feature extraction methods on results

This article uses a total of six feature extraction methods. Among them, the PseAAC method not only considers the sequence information of the protein but also includes the position information of the amino acids in the sequence. The EBGW method is based on the physical and chemical properties of amino acids to effectively extract the physical and chemical information of proteins. The MCD method uses multiple regions as features to extract the physical and chemical information of protein sequences. The GO method extracts the annotation information of the protein, which can essentially analyze the properties of genes and gene products. Because the EDT method considers the evolutionary information of the protein, it can reflect the probability of two different amino acids. The RPT method obtains the evolutionary information of the protein by grouping the evolution scores in PSSM. Therefore, the six feature extraction methods obtain effective information from the different characteristics of the protein, which greatly improves the prediction performance of the model. Through the LOOCV test, six single feature vectors are put into MLFE, and GO has the largest contribution rate among all single features, and its OAA and OLA reach 91.91% and 93.31%, respectively. However, single feature information cannot represent all important information. Therefore, the six feature extraction results need to be fused. We extract 912-dimensional feature vectors from GO, 45-dimensional feature vectors from PseAAC, 120-dimensional feature vectors from EBGW, 400-dimensional feature vectors from RPT and EDT, respectively, and 630-dimensional feature vectors from MCD. After the final fusion, 2507-dimensional feature vectors are obtained. Through the LOOCV test, the comparison results of single and fusion features are given in Figure 3.

Fig. 3.

Comparison of results based on seven different methods for Gram-positive bacteria. ALL: PseAAC+EBGW+EDT+RPT+GO+MCD. The six single feature extraction methods, the GO method has greatest contribution rate to the model. For the fusion features, the OAA and OLA are lower than GO due to the increase of redundant information in the fusion feature space. But compared with the other five single characteristics, the OAA is 26.40–39.89% higher than other methods, and the OLA is 25.81–40.15% higher than other methods. Therefore, the fusion features can represent the overall characteristics of the protein and improve the accuracy of the model prediction

3.3 Analysis of feature selection results

Feature selection method can reduce spatial dimensions and decrease model training time. Therefore, this article uses principal component analysis (PCA) (Abdi and Williams, 2010), GRRO (Zhang ), MDFS (Zhang ), MDDM (Zhang and Zhou, 2010), MVMD (Xu ), MLSI (Yu ) to eliminate irrelevant features. Through LOOCV test, the feature subset obtained by each method is put into MLFE. Then, the OAA of MLSI reaches 99.23%, and the OLA reaches 99.81%, which are both optimal. The algorithm not only retains the original input features, but also captures the correlation of output dimensions, which greatly improves the performance of model prediction. On the Gram-positive bacteria dataset, the MLSI method selects different dimensions to obtain the prediction results which are shown in Supplementary Table S8, and the comparison results of different methods can be found in Figure 4.

Fig. 4.

Comparison results based on different dimension reduction methods. When the 80-dimensional feature subset is obtained by MLSI, the results of OAA and OLA both reach the highest. This method uses the linear correlation of input information and output information to select the feature subset, which greatly improves the ability of prediction. At the same time, MLSI has increased the OAA by 7.52–37.77%, and the OLA by 6.69–35.95% compared with other methods. Therefore, MLSI is chosen as the feature selection method

3.4 Comparative analysis of classification algorithms

To verify the effectiveness of MLFE, we take five classifiers as comparison. That are ML -nearest neighbor (ML-KNN) (Gonzalez-Lopez ), ML radial basis function (ML-RBF) (Zhang, 2009), ML learning with label-specific features (LIFT) (Zhang and Wu, 2015), ranking SVM (Rank-SVM) (Tayal ), ML learning by instance differentiation (INSDIF) (Zhang ). The optimal feature subset obtained by the MLSI is put into six classifiers. Through the LOOCV test, the results of OAA and OLA obtained from the MLFE classifier are 99.23% and 99.81%, respectively. The algorithm uses the sparse reconstruction information between the training samples as features, and the reconstruction information is passed into the label space to enrich the original labels as numerical labels, thereby enhancing the effectiveness of the label information. The comparison results of different methods are shown in Figure 5, and the corresponding receiver operating characteristic (ROC) and precision-recall (PR) curves are shown in Figure 6. The specific parameter values obtained through different algorithms are shown in Supplementary Table S9.

Fig. 5.

Fig. 6.

Comparison of ROC and PR curves of six classifiers. (A) The ROC curve of the Gram-positive bacteria dataset corresponding to the six classifiers. (B) The PR curves of the Gram-positive bacteria dataset corresponding to the six classifiers. The ROC and PR curves are usually used to evaluate the quality of the model. The closer the ROC curve is to the upper left corner, the higher the accuracy of the model. Conversely, the closer the PR curve is to the upper right corner, the higher the accuracy of the model. The area under the receiver operating characteristic curve (AUC) and area under the precision-recall curve (AUPR) of MLFE are both optimal. The AUC of MLFE is 99.75%, which is 3.09–7.15% higher than the AUC of the other five classifiers, and the AUPR is 98.13%, which is 1.29–10.30% higher than the other five classifiers

Prediction results of six classifiers on Gram-positive bacteria dataset. The MLFE classifier is used to predict the multi-label protein SCL, the OAA and OLA are both highest and above 99%. The algorithm uses sparse reconstruction of the training samples to represent the bottom layer of the feature space. At the same time, the OAA of MLFE is 5.01–11.56% higher than the other five classifiers, and the OLA is 4.78–10.14% higher than the other five classifiers. In summary, MLFE can effectively link feature information with label information, which improves the prediction performance of the model Comparison of ROC and PR curves of six classifiers. (A) The ROC curve of the Gram-positive bacteria dataset corresponding to the six classifiers. (B) The PR curves of the Gram-positive bacteria dataset corresponding to the six classifiers. The ROC and PR curves are usually used to evaluate the quality of the model. The closer the ROC curve is to the upper left corner, the higher the accuracy of the model. Conversely, the closer the PR curve is to the upper right corner, the higher the accuracy of the model. The area under the receiver operating characteristic curve (AUC) and area under the precision-recall curve (AUPR) of MLFE are both optimal. The AUC of MLFE is 99.75%, which is 3.09–7.15% higher than the AUC of the other five classifiers, and the AUPR is 98.13%, which is 1.29–10.30% higher than the other five classifiers

3.5 Comparison with other methods

With the continuous development of ML protein SCL research, many researchers use machine learning methods to predict. To prove the superiority of the ML-locMLFE, we compare the results of four datasets with other methods. On the Gram-positive dataset, the results of this article are compared with the results of iLoc-pos (Wu ), Gpos-ECC-mPLoc (Wang ) and Gram-LocEN (Wan ). The results of different methods are listed in Figure 7. On the Gram-negative dataset, the results of this article are compared with the results of iLoc-Gneg (Chou and Shen, 2006), Gneg-ECC-mPLoc (Wang ) and Gram-LocEN (Wan ). On the virus dataset, the results of this article are compared with the results of mGOASVM (Wan ), AD-SVM (Wan and Mak, 2018) and mPLR-Loc (Wan ). On the newPlant dataset, the results of this article are compared with the results of Plant-mPLoc (Chou and Shen, 2010), mPLR-Loc (Wan ) and HybridGO-Loc (Wan ). The result of different comparison on the newPlant dataset is shown in Supplementary Table S10, and the results of different methods on other datasets are listed in Supplementary Figures S1 and S2.

Fig. 7.

On the Gram-positive bacteria dataset, the ML-locMLFE is compared with other methods by LOOCV test. The OAA and the OLA are 99.23% and 99.81% by MLFE, which is 2.93–6.33%, 3.01–6.71% higher than other methods. In addition, the prediction results of the four types of subcellular locations by this method are 99.42%, 100.00%, 100.00% and 100.00%, which are 1.72–3.42%, 5.60–33.33%, 2.84–4.80%, 4.88–10.57% higher than other methods, respectively. Therefore, the ML-locMLFE is superior to other methods using the same dataset

3.6 Prediction ML protein SCL of SARS-CoV-2

Since the SARS-CoV-2 has brought us great influence, it is important to locate the subcellular location of SARS-CoV-2 protein accurately and quickly. Many researchers have found a way to treat COVID-19 by analyzing the pathogenesis of SARS-CoV-2. German scientist (Hoffmann ) found that SARS-CoV-2 transmission depended on transmembrane protease serine 2 (TMPRSS2), the protease inhibitors of TMPRSS2 can block SARS-CoV-2 into cells. Xu predicted that the TMPRSS2 can bind to some monomeric compounds independently by studying the protein properties and 3D structure of TMPRSS2. Therefore, it is shown that the TMPRSS2 is a serine protease anchored on the cell membrane at the amino-terminal transmembrane region, and its inhibitor can be used as the treatment of COVID-19. The SARS-CoV-2 protein information is obtained by PseAAC, EBGW, GO, RPT, EDT and MCD to form the original feature space. With the continuous increase of the dimensionality, the interference of redundant information on the result is gradually significant. Thus, we use MLSI to obtain the optimal feature subset. Using the MLFE algorithm, the OAA is 72.73% and the OLA is 69.23%. The specific result comparison is shown in Table 2.

Table 2.

Prediction of protein SCL using ML-locMLFE on SARS-CoV-2 dataset

Locations	ML-locMLFE
Plasma membrane	10/10 = 1.0000
Cytoskeleton	0/1 = 0.0000
Endoplasmic reticulum	0/1 = 0.0000
Endosome	0/2 = 0.0000
Golgi apparatus	0/1 = 0.0000
Lysosome	0/2 = 0.0000
Mitochondrion	0/2 = 0.0000
Nucleus	7/7 = 1.0000

Prediction of protein SCL using ML-locMLFE on SARS-CoV-2 dataset Table 2 shows that the SARS-CoV-2 protein mainly exists in Plasma membrane, nucleus, Golgi apparatus and other subcellular. We obtained 26 proteins from the UniProt database and found that the ninth protein is TMPRSS2, which is located on the plasma membrane and accurately predicted by ML-locMLFE. The SARS-CoV-2 dataset is too small to optimize the model, the stability of the model is relatively low. Therefore, the prediction results of Cytoskeleton, Endoplasmic reticulum, Endosome, Golgi apparatus and Lysosome are not ideal, but the OAA and OLA are 72.73% and 69.23% by the ML-locMLFE method. The model presented in this article can not only predict the SCL of important protein in the SARS-CoV-2 quickly and accurately, but also provide a theoretical basis for the treatment of SARS-CoV-2 pneumonia and drug research.

4 Conclusion

It is significant to understand the structure and function of protein by using machine learning methods to predict the SCL of ML protein. First, PseAAC, EBGW, GO, RPT, EDT, MCD are used to extract important information about various properties of proteins. Among them, the GO method has the highest prediction accuracy and the largest contribution rate compared with the other five methods. The annotation information of genes and gene products extracted using the GO method can provide important evidence for the study of protein functions. Second, it is the first time to use MLSI as feature selection in the prediction of ML protein SCL. This method can map input features to a new feature space, which not only ensures the existence of input information, but also captures the correlation between multiple output information, so that it can more effectively select the best feature subset. Finally, we integrate the optimal feature subset into MLFE. For the first time, MLFE algorithm is used to enrich the original labels of training samples into numerical labels to enhance the effectiveness of ML information, which further improves the performance of the model. Through the LOOCV test, the OAA of the Gram-positive bacteria dataset, the Gram-negative bacteria dataset, the virus dataset, the SARS-CoV-2 dataset and the newPlant dataset are 99.23%, 93.82%, 93.24%, 72.73%, 96.72%, and the OLA are 99.81%, 96.50%, 99.21%, 69.23%, 96.25%, respectively. Therefore, the ML-locMLFE proposed in this article can predict the ML protein SCL more accurately. In addition, the ML-locMLFE model can spread to other research fields such as ML protein post-translational modification, ML mRNA SCL and identification of drug-target interactions. More importantly, the model can accurately predict the SCL of the SARS-CoV-2 protein, and then clarify the pathogenic mechanism of the virus. We hope that our method can provide some insights and help in the clinical treatment of various diseases, including COVID-19. In the next step, we will construct larger scale and diverse datasets to study the SCL of ML protein. Click here for additional data file.

39 in total

1. Lift: Multi-Label Learning with Label-Specific Features.

Authors:
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2015-01 Impact factor: 6.226

2. A novel method for apoptosis protein subcellular localization prediction combining encoding based on grouped weight and support vector machine.

Authors: Zhen-Hui Zhang; Zheng-Hua Wang; Zhen-Rong Zhang; Yong-Xian Wang
Journal: FEBS Lett Date: 2006-10-17 Impact factor: 4.124

3. Mesenchymal transition in kidney collecting duct epithelial cells.

Authors: Larissa Ivanova; Michael J Butt; Douglas G Matsell
Journal: Am J Physiol Renal Physiol Date: 2008-03-05

4. Protein-protein interaction sites prediction by ensemble random forests with synthetic minority oversampling technique.

Authors: Xiaoying Wang; Bin Yu; Anjun Ma; Cheng Chen; Bingqiang Liu; Qin Ma
Journal: Bioinformatics Date: 2019-07-15 Impact factor: 6.937

5. Predicting drug-target interactions using Lasso with random forest based on evolutionary information and chemical structure.

Authors: Han Shi; Simin Liu; Junqi Chen; Xuan Li; Qin Ma; Bin Yu
Journal: Genomics Date: 2018-12-11 Impact factor: 5.736

6. DTI-MLCD: predicting drug-target interactions using multi-label learning with community detection method.

Authors: Yanyi Chu; Xiaoqi Shan; Tianhang Chen; Mingming Jiang; Yanjing Wang; Qiankun Wang; Dennis Russell Salahub; Yi Xiong; Dong-Qing Wei
Journal: Brief Bioinform Date: 2021-05-20 Impact factor: 11.622

7. Multilabel Feature Selection: A Local Causal Structure Learning Approach.

Authors: Kui Yu; Mingzhu Cai; Xingyu Wu; Lin Liu; Jiuyong Li
Journal: IEEE Trans Neural Netw Learn Syst Date: 2021-09-16 Impact factor: 10.451

8. Prediction of protein-protein interactions from amino acid sequences using a novel multi-scale continuous and discontinuous feature set.

Authors: Zhu-Hong You; Lin Zhu; Chun-Hou Zheng; Hong-Jie Yu; Su-Ping Deng; Zhen Ji
Journal: BMC Bioinformatics Date: 2014-12-03 Impact factor: 3.169

9. ProLoc-GO: utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization.

Authors: Wen-Lin Huang; Chun-Wei Tung; Shih-Wen Ho; Shiow-Fen Hwang; Shinn-Ying Ho
Journal: BMC Bioinformatics Date: 2008-02-01 Impact factor: 3.169

10. Multi-location gram-positive and gram-negative bacterial protein subcellular localization using gene ontology and multi-label classifier ensemble.

Authors: Xiao Wang; Jun Zhang; Guo-Zheng Li
Journal: BMC Bioinformatics Date: 2015-08-25 Impact factor: 3.169