Literature DB >> 23401665

TOPPER: topology prediction of transmembrane protein based on evidential reasoning.

Xinyang Deng¹, Qi Liu, Yong Hu, Yong Deng.

Abstract

The topology prediction of transmembrane protein is a hot research field in bioinformatics and molecular biology. It is a typical pattern recognition problem. Various prediction algorithms are developed to predict the transmembrane protein topology since the experimental techniques have been restricted by many stringent conditions. Usually, these individual prediction algorithms depend on various principles such as the hydrophobicity or charges of residues. In this paper, an evidential topology prediction method for transmembrane protein is proposed based on evidential reasoning, which is called TOPPER (topology prediction of transmembrane protein based on evidential reasoning). In the proposed method, the prediction results of multiple individual prediction algorithms can be transformed into BPAs (basic probability assignments) according to the confusion matrix. Then, the final prediction result can be obtained by the combination of each individual prediction base on Dempster's rule of combination. The experimental results show that the proposed method is superior to the individual prediction algorithms, which illustrates the effectiveness of the proposed method.

Entities: Chemical Disease

Mesh：

Substances：
Membrane Proteins

Year: 2013 PMID： 23401665 PMCID： PMC3562663 DOI： 10.1155/2013/123731

Source DB: PubMed Journal: ScientificWorldJournal ISSN： 1537-744X

1. Introduction

According to the present genome data, roughly 20–30% of the genes in a typical organism code for α-helical transmembrane (TM) protein [1-3]. Transmembrane protein is the principal executives of the biomembrane's functions and plays many important roles in cell such as substance transportation, and energy conversion. In order to explore the structure, function, and transmembrane mechanism of transmembrane protein, the topology prediction of transmembrane protein has been a hot field in bioinformatics and molecular biology [1, 2, 4]. The topology of transmembrane protein [5], that is, the number and position of the transmembrane helixes and the in/out location of the N and C terminal of the protein sequence, is an important issue for the research of transmembrane proteins. For a protein sequence, if both transmembrane helixes and location of the N and C terminal have been predicted correctly, the topology of the protein sequence is said to be predicted correctly. Recently, information science and technology are widely used in the biology and medicine [6-8]. In essence, the topology prediction of transmembrane protein is a typical pattern recognition problem. As shown in Figure 1, given a protein sequence, the task is to determine the class label for each residue among these three classes of “i” (intracellular), “M” (transmembrane), and “o” (extracellular). At present, the most accurate methods to determine the topology of transmembrane protein are some experimental techniques, such as nuclear magnetic resonance (NMR) and X-ray crystal diffraction. However, these experimental techniques usually require strict conditions so that they cannot be applied on a large scale. They cannot meet the needs of the increasing protein sequences. Therefore, various computational methods have been developed to predict the topology of transmembrane protein [9-11].

Figure 1

Topology prediction of transmembrane protein.

Generally speaking, in a previous study there mainly exist three primary kinds of algorithms to predict the topology of transmembrane protein. The first kind of algorithms is on the basis of the chemical or physical properties of amino acids, for example, the hydrophobicity of residues or the charges of residues in different location. Some classical prediction algorithms are TopPred [2], and so on [12, 13]. The second kind of algorithms for the topology prediction is based on the statistical analysis on a huge amount of structure known as transmembrane proteins, such as MEMSAT [14], TMAP [10], and PRED-TMR [15]. In the third kind of algorithms, various machine learning technologies such as hidden Markov model (HMM) and support vector machine (SVM) have been introduced to the prediction of transmembrane protein topology. A series of algorithms have been developed, for example, HMMTOP [11], PHDhtm [16, 17], and so forth [18-21]. According to the mentioned above, even though there exists many algorithms for the prediction of transmembrane protein topology, however, different algorithms depend on different principles, and their applicable scopes are different. To a prediction system, if more information have been taken into consideration, the prediction ability of the system must be much more stronger. Essentially, it is a viewpoint of ensemble learning [22-25]. Using this idea to the topology prediction of transmembrane protein, various prediction algorithms have been treat as basic predictors; the task is the combination of multiple predictors to obtain a combination predictor which has a better performance than basic predictors. Within this process, there are two critical problems, that is, the representation of each predictor's prediction results and the combination method of combining multiple predictors. In regard to the representation of predictor's prediction results, as Xu et al. [23] pointed three types of output information can be utilized for different prediction algorithms, namely, the information in the abstract level, rank level, and measurement level, respectively. As to the combination method, traditional methodologies are usually on the basis of the framework of probability theory. To some degree, it is very effective, especially for the randomness. However, in the real world there are various uncertainties, not only the randomness but also the fuzziness and incompleteness, and so forth [26, 27]. As a theory of evidential reasoning under the uncertain environment, the Dempster-Shafer theory of evidence [28, 29] has an advantage of directly expressing various uncertainties and has been widely used in many fields [30-37]. It provides a general and effective framework for the representation and combination of multiple individual algorithms. In this paper, a new topology prediction method of transmembrane protein based on evidential reasoning approach, called TOPPER, has been proposed. In the proposed TOPPER method, the prediction results of basic predictor are represented by basic probability assignment (BPA) which has been constructed in terms of the confusion matrix of the predictor. Then, various basic predictors are combined by using the Dempster's rule of combination. Finally, the topology of a transmembrane protein sequence are determined according to the combination prediction results. In this paper, an experiment demonstrates the effectiveness of the propose prediction method. The rest of this paper is organized as follows. Section 2 introduces some basic concepts about the Dempster-Shafer theory of evidence. In Section 3 the proposed method is presented. Section 4 gives experimental verification to demonstrate the effectiveness of the proposed method. Conclusions are given in Section 5.

2. Preliminaries

In this section, a few concepts commonly in the Dempster-Shafer theory of evidence will be introduced. The Dempster-Shafer theory of evidence [28, 29], also called the Dempster-Shafer theory or evidence theory, is used to deal with uncertain information. As an effective theory of evidential reasoning, the Dempster-Shafer theory has an advantage of directly expressing various uncertainties. This theory needs weaker conditions than the Bayesian theory of probability, so it is often regarded as an extension of the bayesian theory. For completeness of the explanation, a few basic concepts are introduced as follows.

Definition 1

Let Ω be a set of mutually exclusive and collectively exhaustive, indicted by The set Ω is called frame of discernment. The power set of Ω is indicated by 2, where If A ∈ 2, A is called a proposition.

Definition 2

For a frame of discernment Ω, a mass function is a mapping m from 2 to [0,1], formally defined by which satisfies the following condition: In the Dempster-Shafer theory, a mass function is also called a basic probability assignment (BPA). If m(A) > 0, A is called a focal element, the union of all focal elements is called the core of the mass function.

Definition 3

For a proposition A⊆Ω, the belief function Bel : 2 → [0,1] is defined as The plausibility function Pl : 2 → [0,1] is defined as where . Obviously, Pl(A) ≥ Bel(A); these functions Bel and Pl are the lower limit function and upper limit function of proposition A, respectively. Consider two pieces of evidence indicated by two BPAs m 1 and m 2 on the frame of discernment Ω; the Dempster's rule of combination is used to combine them. This rule assumes that these BPAs are independent.

Definition 4

The Dempster's rule of combination, also called orthogonal sum, denoted by m = m 1⨁m 2, is defined as follows: with Note that the Dempster's rule of combination is only applicable to such two BPAs which satisfy the condition K < 1.

3. Proposed Method

In this section, a new transmembrane protein topology prediction method is proposed based on evidential reasoning. For the sake of convenience, it is briefly written down as TOPPER (Topology prediction of transmembrane protein based on evidential reasoning). The proposed prediction method TOPPER is on the basis of the combination of multiple individual prediction algorithms. In order to obtain the combination predictor, the process is presented step by step as follows.

3.1. The Selection of Basic Predictor

Because the proposed topology prediction method is the combination of multiple individual prediction methods, the basic predictors should be constructed first. Here, five individual prediction algorithms, OCTOPUS [3], PRO-TMHMM and PRODIV-TMHMM [38], SCAMPI-msa, and SCAMPI-seq [13], have been selected to construct these basic predictors. In pattern recognition, the prediction performance of each predictor is expressed by confusion matrix. In the topology prediction of transmembrane protein, since there are only three classes “i” (intracellular), “M” (transmembrane), and “o” (extracellular), the confusion matrix is formulated by where each item n is the number of residues belonging to the class p but predicted as the class q according to the basic predictor φ.

3.2. The Representation of the Basic Predictor's Prediction Results

In the combination of multiple predictors, the representation of the basic predictor's prediction results is a critical problem. In this paper, BPA is used to represent these prediction results. But the next is how to construct BPAs. For example, a residue in a protein sequence has been predicted that it belongs to transmembrane helix (i.e., class “M”) by a basic predictor. However, due to that the prediction is not 100% correct, how can we represent this uncertainty. Here, a classical and effective method proposed by Xu et al. [23] has been adopted to construct BPAs. In Xu et al.'s method, the output was treated as single class labels, and the source of evidence for the propositions of interest was defined on the basis of the performance of predictors in terms of recognition, substitution, and rejection rates which are generated from confusion matrix. Briefly speaking, it is a BPA construction method based on confusion matrix. To a predictor of transmembrane protein topology φ with confusion matrix C , according to Xu et al.'s method [23], a BPA can be constructed for each class p by with where Ω = {i, M, o}. For a residue in a protein sequence, the constructed BPA is m i if the prediction result shows that the residue belongs to class i. In two other situations of M and o, the constructed BPAs are m M and m o , respectively.

3.3. The Combination of Multiple Predictors

Once all BPAs of each predictor have been constructed, the prediction results of multiple predictors can be combined. In this paper, these prediction results of basic predictors have been treated as various evidences coming from different sources. The various prediction results can be combined by using the Dempster's rule of combination, as shown in Figure 2.

Figure 2

The combination of multiple predictors.

Assume there are N basic predictors in the evidential prediction system, S is the set of constructed BPAs for all classes from basic predictor φ, and S = {m i , m M , m o }. g(S ) is an operation used to obtain the matched BPA for a residue predicted by φ. The combination of multiple predictors to predict the class of residue r can be expressed by

3.4. The Determination of Topology

Through the above steps, the combination prediction result has been derived for each residue in a transmembrane protein sequence. It is indicated by a BPA m . In order to get the final class that the residue belongs to, the BPA will be translated into a probability distribution by using the so-called pignistic probability transformation (PPT) function, proposed by Smets and Kennes in the transferable belief model (TBM) [39]. The PPT function [39] is defined as follow. Let m be a BPA on a frame of discernment Ω, a pignistic probability transformation function BetP : Ω → [0,1] corresponding to m is where |A| is the cardinality of proposition A. By using PPT function, the BPA m can be translated into a probability distribution p . Then the class of the residue r can be determined according to the maximum value of the probability distribution p . At last, the topology of a transmembrane protein can be determined when the classes of all residues in the protein sequence have been determined. For each protein, the transmembrane orientation is determined by the location of the first residue, and each transmembrane region whose length exceeds a threshold consists of these residues labelled as class “M.” According to the topology, all transmembrane helixes and the orientation of each transmembrane helix can be derived.

4. Experimental Verification

In this paper, a data set of 125 transmembrane protein sequences with known topology is collected from the data set of MPtopo [40] to verify the effectiveness of the proposed method TOPPER. In order to reflect the performance of combination predictor faithfully and to avoid overfitting, the experiment is performed using tenfold cross-validation. For each fold, it roughly contains 12-13 transmembrane proteins and their homology has been reduced to 30% below by using cd-hit program [41]. In order to assess the prediction performance of transmembrane regions (i.e., transmembrane helixes without considering orientation) of different algorithms, an evaluation method developed by Tusnády and Simon [11] is adopted in this paper. To a transmembrane region, the prediction is considered successful when the overlapping region of predicted and observed transmembrane region contains at least 9 amino acids. The total numbers of predicted and real observed transmembrane regions are indicated by N prd and N obs, respectively. The overlapping predicted and real observed transmembrane regions are indicated by N cor. The efficiency of the transmembrane regions prediction is measured by M = N cor/N obs and C = N cor/N prd. The overall prediction power is defined by Besides, if all transmembrane regions and orientation of a transmembrane protein sequence have been predicted correctly, the topology of the transmembrane protein is said to be predicted correctly. In the rest of this section, various prediction algorithms will be compared from three aspects, namely, the prediction performance of residue level, transmembrane region level, and topology level, respectively. In the level of residue prediction, the confusion matrix of residue prediction for each algorithm is shown in Table 1. According to these confusion matrices, Table 2 shows some indexes to measure the performance of residue prediction, including the recall rate, precision rate, F score of each class, and the prediction accuracy of residues. In TOPPER, the prediction accuracy of residue is 80.00%, while in other algorithms they are 78.69%, 77.91%, 77.63%, 78.69%, and 77.66%, respectively. The proposed method has the highest prediction accuracy of residue, shown in Figure 3. In addition, investigate the F score of each class in these algorithms. The TOPPER also has the highest value of F score no matter to class “i”, “M”, and “o”, shown in Figure 4. Hence, it is quite clear that the proposed TOPPER outperforms other algorithms.

Table 1

Confusion matrices of residue prediction for various algorithms.

Truth	Algorithm		Prediction
Truth	Algorithm	i	M	o
	OCTOPUS	7655	389	839
	PRO	7574	450	859
i	PRODIV	7323	442	1118
i	SCAMPI-msa	7655	389	839
	SCAMPI-seq	7359	455	1069
	TOPPER	7636	358	889

	OCTOPUS	1877	9785	1458
	PRO	1922	9588	1610
M	PRODIV	1819	9884	1417
M	SCAMPI-msa	1877	9785	1458
	SCAMPI-seq	1907	9628	1585
	TOPPER	1799	9817	1504

	OCTOPUS	1230	578	6091
	PRO	1051	714	6134
o	PRODIV	1117	775	6007
o	SCAMPI-msa	1230	578	6091
	SCAMPI-seq	1101	564	6234
	TOPPER	916	518	6465

Table 2

Prediction performance of various algorithms in residue level.

Algorithm	Class	Recall (%)	Precision (%)	F score	Accuracy (%)
	i	86.18	71.13	0.7793
OCTOPUS	M	74.58	91.01	0.8198	78.69
	o	77.11	72.62	0.7480

	i	85.26	71.81	0.7796
PRO	M	73.08	89.17	0.8033	77.91
	o	77.66	71.30	0.7434

	i	82.44	71.38	0.7651
PRODIV	M	75.34	89.04	0.8162	77.63
	o	76.05	70.32	0.7307

	i	86.18	71.13	0.7793
SCAMPI-msa	M	74.58	91.01	0.8198	78.69
	o	77.11	72.62	0.7480

	i	82.84	70.98	0.7646
SCAMPI-seq	M	73.38	90.43	0.8102	77.66
	o	78.92	70.14	0.7427

	i	85.96	73.77	0.7940
TOPPER	M	74.82	91.81	0.8245	80.00
	o	81.85	72.98	0.7716

Figure 3

The comparison of residue's prediction accuracy between the proposed method and other algorithms.

Figure 4

The comparison of F score between the proposed method and other algorithms.

In the level of transmembrane region prediction, Table 3 shows the prediction performance of various algorithms to the prediction of transmembrane region. According to the overall prediction power defined in [11], the Q value of TOPPER is 97.85%, while the Q values of other algorithms are 97.37%, 96.98%, 96.83%, 97.37%, and 96.68%, respectively. The Q value of TOPPER is the highest, shown in Figure 5. So TOPPER is superior to other algorithms.

Table 3

Prediction performance of various algorithms in transmembrane region level.

Algorithm	N _obs	N _prd	N _cor	M (%)	C (%)	Q (%)
OCTOPUS	515	512	500	97.09	97.66	97.37
PRO	515	512	498	96.70	97.27	96.98
PRODIV	515	524	503	97.67	95.99	96.83
SCAMPI-msa	515	512	500	97.09	97.66	97.37
SCAMPI-seq	515	507	494	95.92	97.44	96.68
TOPPER	515	507	500	97.09	98.62	97.85

Figure 5

The comparison of transmembrane region's prediction performance between the proposed method and other algorithms.

In the level of topology prediction, Table 4 shows the prediction accuracy of topology for each algorithm. The topology's prediction accuracy of TOPPER is 74.4%, which is the highest among these algorithms, shown in Figure 6. Therefore, the proposed TOPPER is superior to other algorithms.

Table 4

Prediction performance of various algorithms in topology level.

Algorithm	Prediction accuracy of topology (%)
OCTOPUS	71.2
PRO	70.4
PRODIV	69.6
SCAMPI-msa	71.2
SCAMPI-seq	69.6
TOPPER	74.4

Figure 6

The comparison of topology's prediction accuracy between the proposed method and other algorithms.

According to the mentioned above, the proposed TOPPER outperforms other algorithms no matter in the level of residue prediction, transmembrane region prediction, and topology prediction. Hence, the effectiveness of the proposed method has been demonstrated.

5. Conclusions

Transmembrane proteins are some special and important proteins in cells. The topology prediction of transmembrane protein is a foundation of the research of transmembrane proteins. In this paper, a new topology prediction method of transmembrane protein is proposed based on evidential reasoning. The proposed method is the combination of multiple individual prediction algorithms. In the proposed method, the Dempster-Shafer theory has been used to represent and combine the results of basic predictors. Experimental results show that the proposed method is superior to the individual prediction algorithms and demonstrates the effectiveness of the proposed method.

27 in total

1. A novel method for predicting transmembrane segments in proteins based on a statistical analysis of the SwissProt database: the PRED-TMR algorithm.

Authors: C Pasquier; V J Promponas; G A Palaios; J S Hamodrakas; S J Hamodrakas
Journal: Protein Eng Date: 1999-05

2. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes.

Authors: A Krogh; B Larsson; G von Heijne; E L Sonnhammer
Journal: J Mol Biol Date: 2001-01-19 Impact factor: 5.469

3. MPtopo: A database of membrane protein topology.

Authors: S Jayasinghe; K Hristova; S H White
Journal: Protein Sci Date: 2001-02 Impact factor: 6.725

4. Best alpha-helical transmembrane protein topology predictions are achieved using hidden Markov models and evolutionary information.

Authors: Håkan Viklund; Arne Elofsson
Journal: Protein Sci Date: 2004-07 Impact factor: 6.725

TOPPER: topology prediction of transmembrane protein based on evidential reasoning.

1. Introduction

2. Preliminaries

Definition 1

Definition 2

Definition 3

Definition 4

3. Proposed Method

3.1. The Selection of Basic Predictor

3.2. The Representation of the Basic Predictor's Prediction Results

3.3. The Combination of Multiple Predictors

3.4. The Determination of Topology

4. Experimental Verification

5. Conclusions

1. A novel method for predicting transmembrane segments in proteins based on a statistical analysis of the SwissProt database: the PRED-TMR algorithm.

2. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes.

3. MPtopo: A database of membrane protein topology.

4. Best alpha-helical transmembrane protein topology predictions are achieved using hidden Markov models and evolutionary information.

5. Scoring hidden Markov models to discriminate beta-barrel membrane proteins.

6. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

7. A HMM-based method to predict the transmembrane regions of beta-barrel membrane proteins.

8. Prediction of membrane-protein topology from first principles.

9. A simple method for displaying the hydropathic character of a protein.

10. Applying fuzzy logic to comparative distribution modelling: a case study with two sympatric amphibians.

1. A bio-inspired method for the constrained shortest path problem.

2. Bridge condition assessment using D numbers.