Literature DB >> 31045538

Predicting subcellular localization of multisite proteins using differently weighted multi-label k-nearest neighbors sets.

Zhongting Jiang¹, Dong Wang^1,2,3, Peng Wu¹, Yuehui Chen¹, Huijie Shang¹, Luyao Wang¹, Huichun Xie³.

Abstract

BACKGROUND: For a protein to execute its function, ensuring its correct subcellular localization is essential. In addition to biological experiments, bioinformatics is widely used to predict and determine the subcellular localization of proteins. However, single-feature extraction methods cannot effectively handle the huge amount of data and multisite localization of proteins. Thus, we developed a pseudo amino acid composition (PseAAC) method and an entropy density technique to extract feature fusion information from subcellular multisite proteins.
OBJECTIVE: Predicting multiplex protein subcellular localization and achieve high prediction accuracy.
METHOD: To improve the efficiency of predicting multiplex protein subcellular localization, we used the multi-label k-nearest neighbors algorithm and assigned different weights to various attributes. The method was evaluated using several performance metrics with a dataset consisting of protein sequences with single-site and multisite subcellular localizations.
RESULTS: Evaluation experiments showed that the proposed method significantly improves the optimal overall accuracy rate of multiplex protein subcellular localization.
CONCLUSION: This method can help to more comprehensively predict protein subcellular localization toward better understanding protein function, thereby bridging the gap between theory and application toward improved identification and monitoring of drug targets.

Entities: Chemical Disease Gene Species

Keywords: Pseudo amino acid composition (PseAAC); entropy density; multi-label k-nearest neighbors (ML-KNN); subcellular localization of multisite proteins; wML-KNN

Year: 2019 PMID： 31045538 PMCID： PMC6598103 DOI： 10.3233/THC-199018

Source DB: PubMed Journal: Technol Health Care ISSN： 0928-7329 Impact factor: 1.285

Introduction

Proteins are more complex and diverse than the DNA sequence that generates them. As protein subcellular localization (PSL) is highly correlated with protein function, effective PSL prediction methods provide information for understanding protein functions that is essential for several research applications such as in molecular biology [1, 2], cell biology [3, 4], and neuroscience [5], as well as in clinical medicine [6, 7, 8]. The prediction and recognition of PSL using bioinformatics methods is one of the fundamental objectives of proteomics and cell biology [9], as PSL information is crucial for discovering protein functions toward helping to understand the complex cellular pathways involved in the process of biological regulation. In recent decades, high-throughput molecular biology technologies have brought about exponential growth in the number of protein sequences identified. Considering the vast amount of molecular data now available and continuously accumulating, it is vital to develop a fast and available method for identifying the subcellular locations of their sequence information [10, 11, 12, 13]. The basic requirement for a cell to function normally is that the proteins operate in specific subcellular positions [14]. The main limitations of current methods of predicting PSL is that they cannot handle information on multiple proteins simultaneously. Thus, new prediction methods must be able to predict both single- and multiple-point PSLs. Since a given feature extraction method and dataset will yield different results depending on the prediction algorithm applied, we here propose a new method for extracting protein subcellular information using a feature fusion technique. Feature extraction methods including the entropy density and pseudo amino acid composition (PseAAC) are typically used for protein coding. Considering these two feature extraction methods, we used the multi-label k-nearest neighbors (ML-KNN) algorithm and assigned diverse weights to various attributes for predicting multiplex PSL. We then evaluated the developed algorithms using a variety of well-established performance indexes.

Materials and method

Dataset

The dataset was used in the present study, which was applied for the Gpos-mPloc predictor as described previously [13]. The dataset consists of 519 different protein sequences, and each protein has one or many subcellular locations: four have two subcellular localizations and the remaining 515 have only one subcellular localization. No protein shows greater than 25% sequence consistency with other proteins. The dataset was downloaded from http://www.csbio.sjtu.edu.cn/bioinf/Gpos-multi/Data.htm (last accessed on April 30, 2018), which includes 174 protein sequences in the cell membrane, 18 protein sequences in the cell wall, 208 protein sequences in the cytoplasm, and 123 extracellular protein sequences.

Feature extraction

Entropy density

The entropy density was first introduced to represent DNA sequences and determine exons existing in gene sequences [15]. The Shannon entropy [16] is expressed as: where represent the normalized occurrence frequencies of the 20 amino acids that occur in protein . Therefore, the entropy density function is defined as: The sequence of protein is then represented as: Where are the entropy values of the 20 amino acids in the protein sequence.

PseAAC

PseAAC was first developed to prevent the loss of hidden information in protein sequences and to provide a better expression of the initial information about AACs [17, 18, 19]. PseAAC reflects a protein expressed by generating 20 discrete numbers, and includes both the main features of the AAC and information beyond the AAC. Suppose the protein molecule is represented by: where indicates the amino acid residue located in area . The protein sequence and other relevant information are also included in this expression. From the inherent nature of this representation, we can approximate the amino acid order information within a peptide chain sequence as a set of sequence order correlation factors, which can be expressed as: where and denote the hydrophobicity, hydrophilicity, and the amino acid side-chain mass of (), respectively. These values should first be transformed so that the average value of the coding sequence of the 20 amino acids is zero, and then substituted into Eq. (6). The transformations can be expressed as follows: In the classic 20-dimensional AAC, the sequence order-correlated factors in Eq. (5) are added to produce a PseAAC containing (20 ) units. That is, we can express the characterization of a protein sample as: where () indicate the normalized occurrence frequencies of the 20 amino acids in protein , is the -tier sequence-correlation factor of the proteins based on Eqs (5)–(7), and w is the weight of the sequence order effect. In Eqs (8) and (9), units 1–20 reflect the influence of the classical AAC, and the remaining units reflect the influence of the sequence order. The input of the algorithm combines these two feature extraction methods, which involves the simple addition of dimensions.

Algorithm

To guarantee the accuracy of PSL, a machine-learning algorithm must first be carefully designed. Although there are many algorithms available for predicting PSL, most of these focus only on a single subcellular location of a protein sequence, and cannot handle proteins located at multiple sites. With the increase in the discovery of multisite proteins, it is vital to obtain a fast and convenient method to recognize the subcellular location of proteins that do not have specific features. Here, we explored the use of the ML-KNN algorithm for this purpose, which is known to be beneficial for multi-label learning problems [20, 21, 22, 23]. In a test case , ML-KNN should identify the nearest neighbors within the training set. signifies that has label , whereas signifies that does not have label . Moreover, the event that exactly cases among the nearest neighbors of have label is denoted by . Let be the class vector of . If , then the l-th unit is 1; otherwise, it is 0. The expression of is similar to . Therefore, according to the neighbor set of tags, the membership count vector can be expressed as: where represents the serial number of neighbors of at the l-th level. Specifically, the prior probability of all labels is calculated based on whether they are training samples or test samples: Next, the Bayesian rule is used to compute the posterior probabilities: where is the number of instances in which there are exactly samples labeled within the nearest neighbors in the training set; is similar to , except that denotes unlabeled samples. Finally, we compute the maximum posterior probability using the following two equations: However, the uneven features in the training set will lead to relatively low accuracy. Thus, we developed an improved algorithm to assign suitable weights to each attribute [24]. The weighting factors are defined as: Flowchart of the prediction algorithm. where AvgNum represents the average number of different categories of samples and represents the number of samples in class . This novel ML-KNN method considering weighted prior probabilities was named wML-KNN. A flowchart of the prediction algorithm is shown in Fig. 1.

Figure 1.

Flowchart of the prediction algorithm.

Evaluation measures

The following multi-label evaluation indexes [25, 26] were applied to the multi-label test set : Hamming Loss, one-error, coverage, average recall, and absolute-true, described below in turn.

Hamming Loss

Hamming Loss is an indicator of the frequency with which an instance-label pair is wrong, expressed as: Where is the number of test samples, is the predicted label, is the real label, is the symmetry difference between two datasets, and is the number of labels. Thus, performance is inversely proportional to the value of .

One-error

The one-error metric evaluates the maximum frequency of inappropriate labels for a given instance, expressed as: where is a real-valued function. For a case and corresponding label set , a good learning process will output larger values for labels in than for labels that are not in . The one-error evaluates the maximum frequency of inappropriate labels for an instance. Thus, performance is inversely proportional to the value of .

Coverage

The coverage averages the performance over all appropriate labels for a given instance. Thus, performance is inversely proportional to the value of .

Average recall

The average recall is used to evaluate the mean value of labels that rank above specific values of , which actually are in . Thus, pPerformance is proportional to the value of .

Absolute-true

Absolute-true assesses the ratio of predicted labels that are the same as the actual label set. Thus, performance is proportional to the value of .

Results

The feature extraction algorithm affects the final accuracy. In this study, the effect of using the entropy density and PseAAC feature extraction methods was examined. The ML-KNN and wML-KNN methods were both applied for comparison of results. The feature fusion method, with simple addition of dimensions, was also used. Overall, better results were obtained when 1. As shown in Tables 1 and 2, the entropy density and PseAAC feature extraction methods wML-KNN showed better performance than ML-KNN in each evaluation measurement. Moreover, the wML-KNN algorithm achieved better performance than ML-KNN for each evaluation criterion.

Table 1

Results after applying entropy density to different algorithms

Evaluation criterion	Algorithm
	ML-KNN	wML-KNN
Hamming loss↓	0.1783	0.0918
One-error↓	0.3576	0.1855
Coverage↓	0.6099	0.2333
Average recall↑	0.7844	0.9018

“” indicates that larger values provided better results, whereas “” indicates that smaller values provided better results.

Table 2

Results after applying PseAAC to different algorithms

Evaluation criterion	Algorithm
	ML-KNN	wML-KNN
Hamming loss↓	0.1769	0.0798
One-error↓	0.3595	0.1587
Coverage↓	0.5946	0.2084
Average Recall↑	0.7871	0.9149

“” indicates that larger values provided better results, whereas “” indicates that smaller values provided better results.

Results after applying entropy density to different algorithms “” indicates that larger values provided better results, whereas “” indicates that smaller values provided better results. Results after applying PseAAC to different algorithms “” indicates that larger values provided better results, whereas “” indicates that smaller values provided better results. Comparative results of two different algorithms adopting the entropy density method Comparative results of two different algorithms adopting PseAAC The comparative results presented in Tables 3 and 4 suggest that using wML-KNN increases the absolute-true rate by approximately 20%. Thus, the same feature extraction algorithm applied to the same dataset gives different results simply by including weight values within the ML-KNN algorithm. Similar changes in the absolute-true rate were observed with both feature extraction algorithms, indicating that the wML-KNN algorithm provides better prediction accuracy.

Table 3

Comparative results of two different algorithms adopting the entropy density method

Algorithm	ML-KNN	wML-KNN
Absolute-true	62.72%	81.26%

Table 4

Comparative results of two different algorithms adopting PseAAC

Algorithm	ML-KNN	wML-KNN
Absolute-true	62.52%	83.37%

Discussions

Previous PSL prediction studies largely focused on proteins with localization at a single site. Since the discovery of the phenomenon that several proteins have two or more subcellular sites, more attention has been paid to developing multisite subcellular localization prediction and related algorithms. As a multi-label learning method with good generalization ability, the ML-KNN algorithm achieved remarkable results in multisite subcellular localization prediction. Furthermore, using various feature extraction methods and adopting the wML-KNN prediction algorithm can achieve higher accuracy than reported in other studies. For example, using different fusion feature extraction methods, Qu et al. [13] adopted the ML-KNN algorithm to predict multisite subcellular localization in the Gpos-mploc protein dataset, and obtained the best overall accuracy rate of 66.1568%. Lin et al. [9] used ML-KNN as the prediction algorithm to predict a benchmark animal protein dataset and finally obtained an accuracy of 62.88%. Practically, one dataset often contains proteins with both single-site and multisite subcellular localizations. This imbalance in a dataset will inevitably influence the efficiency of the ML-KNN algorithm. Therefore, as an improved version of ML-KNN, we developed the wML-KNN algorithm to reduce the negative effects of this data imbalance in part and to achieve relatively higher prediction accuracy. The ultimate purpose of studying multisite PSL is to understand a protein’s intrinsic function and obtain new insight into the nature of life. However, to date, the prediction of PSL has stagnated at the theoretical and experimental level. Despite the higher prediction accuracy obtained by adopting our proposed method, more research will be needed combining the results obtained in this study to link the theory to practice in this field. The following details should be considered as a preliminary analysis. First, more standardized protein datasets should be constructed to reduce the imbalance of data and the universality of related algorithms. Second, the entropy density and PseAAC are both effective local feature extraction methods. Thus, effective fusion methods of multiple features is expected to further improve the prediction accuracy. Moreover, the fusion of global and local protein physicochemical features, beyond information of the protein image, may bring breakthroughs in this field. Finally, just as important as the improvement of prediction accuracy described herein, it will be necessary to derive a plausible biological explanation for the results, which can facilitate application of the PSL theory into practice and should be a primary focus of the further work in this field.

4 in total

4. PSO-LocBact: A Consensus Method for Optimizing Multiple Classifier Results for Predicting the Subcellular Localization of Bacterial Proteins.

Authors: Supatcha Lertampaiporn; Sirapop Nuannimnoi; Tayvich Vorapreeda; Nipa Chokesajjawatee; Wonnop Visessanguan; Chinae Thammarongtham
Journal: Biomed Res Int Date: 2019-11-19 Impact factor: 3.411