Literature DB >> 21168420

Some remarks on protein attribute prediction and pseudo amino acid composition.

Abstract

With the accomplishment of human genome sequencing, the number of sequence-known proteins has increased explosively. In contrast, the pace is much slower in determining their biological attributes. As a consequence, the gap between sequence-known proteins and attribute-known proteins has become increasingly large. The unbalanced situation, which has critically limited our ability to timely utilize the newly discovered proteins for basic research and drug development, has called for developing computational methods or high-throughput automated tools for fast and reliably identifying various attributes of uncharacterized proteins based on their sequence information alone. Actually, during the last two decades or so, many methods in this regard have been established in hope to bridge such a gap. In the course of developing these methods, the following things were often needed to consider: (1) benchmark dataset construction, (2) protein sample formulation, (3) operating algorithm (or engine), (4) anticipated accuracy, and (5) web-server establishment. In this review, we are to discuss each of the five procedures, with a special focus on the introduction of pseudo amino acid composition (PseAAC), its different modes and applications as well as its recent development, particularly in how to use the general formulation of PseAAC to reflect the core and essential features that are deeply hidden in complicated protein sequences.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Amino Acids
Proteins

Year: 2010 PMID： 21168420 PMCID： PMC7125570 DOI： 10.1016/j.jtbi.2010.12.024

Source DB: PubMed Journal: J Theor Biol ISSN： 0022-5193 Impact factor: 2.691

Introduction

With the explosive growth of protein sequences generated in the postgenomic age, scientists are anxious to know their attributes because they are closely correlated with the structures and functions of the proteins as well as their roles in biological processes, and hence are very important to both basic research and drug target development. For instance, given an uncharacterized protein sequence, what is its folding rate? Which structural class and quaternary structural attribute does it belong to? Which subcellular location site does it resides? Can it simultaneously exist in or move between two and more subcellular locations? How can we identify it as an enzyme or non-enzyme? If it is an enzyme, to which enzyme functional class does it belong? Is it a membrane protein or non-membrane protein? If the former, to which membrane protein type does it belong? Is it a protease? If it is, to which protease type does it belong? Is it a G protein-coupled receptor (GPCR)? If it is, to which GPCR type does it belong? Which part of the protein serves as its signal sequence? Where are its cleavage sites by proteases such as HIV (human immunodeficiency virus) protease and SARS (severe acute respiratory syndrome) enzyme? And so forth. Although the answers to these questions can be determined by conducting various biochemical experiments, it is both time-consuming and costly by relying on experimental approaches alone. As a consequence, the gap between the number of newly discovered protein sequences and the knowledge of their attributes is continuing to expand. To bridge such a gap and acquire these kinds of information in a timely manner, scientists are challenged to develop computational methods for predicting various attributes of proteins based on their sequence information alone. To establish a really useful predictor in this regard, one usually needs to accomplish the following procedures: (1) construct a valid benchmark dataset to train and test the predictor; (2) formulate the protein samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the attribute to be predicted; (3) introduce or develop a powerful algorithm (or engine) to operate the prediction; (4) properly perform cross-validation tests to objectively evaluate the accuracy of the predictor; and (5) establish a user-friendly web-server for the predictor that is accessible to the public. This review will discuss each of the above five procedures, with a special focus on procedure 2, particularly on how to use various different modes of pseudo amino acid composition to represent protein samples by incorporating their core and essential features.

Benchmark dataset

To develop a statistical prediction method for a given attribute, the first important thing is to construct a benchmark dataset according to its possible classification, i.e.where represents the subset for category 1 of the attribute, for category 2, and so forth; while represents the symbol for “union” in the set theory, and M the number of different categories for the attribute concerned. For example, when the attribute concerned was about the protein structural classification as investigated in Chou (1995a), Chou and Zhang (1994), Chou (1989), Levitt and Chothia (1976), Nakashima et al. (1986) and Zhou (1998), M would be four as illustrated in Fig. 1; when the structural classification was defined according to the SCOP database (Murzin et al., 1995) or investigated in Chou and Cai (2004b), M would be seven as shown in Fig. 2; when the attribute was about the membrane protein type as investigated in Chou and Shen (2007d), M would be eight (Chou and Shen, 2007d) as illustrated in Fig. 3; when the attribute was about the subcellular localization of eukaryotic proteins as investigated in Chou and Shen (2010a), M would be 22 as illustrated in Fig. 4.

Fig. 1

Fig. 2

Illustration to show the seven categories of protein structural class: (a) all-α, (b) all-β, (c) α/β, (d) α+β, (e) μ (multi-domain), (f) σ (small protein), and (g) ρ (peptide), where the α-helix is colored in red, β-strand in yellow, and the other in green. The PDB codes used to draw the representatives of the seven structural classes are 1a6m, 1uzv, 2f62, 2bf5, 1vqq, 4hir, and 1ter, respectively. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 3

Schematic drawings to show the eight categories of membrane protein types: (1) type I transmembrane, (2) type II, (3) type III, (4) type IV, (5) multipass transmembrane, (6) lipid-chain-anchored membrane, (7) GPI-anchored membrane, and (8) peripheral membrane. As shown in the figure, types I, II, III, and IV are all of single-pass transmembrane proteins; see Spiess (1995) for a detailed description about their difference.

Fig. 4

Schematic illustration to show the 22 subcellular locations of eukaryotic proteins: (1) acrosome, (2) cell wall, (3) centriole, (4) chloroplast, (5) cyanelle, (6) cytoplasm, (7) cytoskeleton, (8) endoplasmic reticulum, (9) endosome, (10) extracellular, (11) Golgi apparatus, (12) hydrogenosome, (13) lysosome, (14) melanosome, (15) microsome (16) mitochondria, (17) nucleus, (18) peroxisome, (19) plasma membrane, (20) plastid, (21) spindle pole body, and (22) vacuole.

Illustration to show the four categories of protein structural class: (a) all-α, (b) all-β, (c) α/β, and (d) α+β, where the α-helix is colored in red, β-strand in yellow, and the other in green. The PDB codes used to draw the representatives of the four structural classes are 1aep, 1gbg, 1enp, and 1aak, respectively. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Illustration to show the seven categories of protein structural class: (a) all-α, (b) all-β, (c) α/β, (d) α+β, (e) μ (multi-domain), (f) σ (small protein), and (g) ρ (peptide), where the α-helix is colored in red, β-strand in yellow, and the other in green. The PDB codes used to draw the representatives of the seven structural classes are 1a6m, 1uzv, 2f62, 2bf5, 1vqq, 4hir, and 1ter, respectively. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Schematic drawings to show the eight categories of membrane protein types: (1) type I transmembrane, (2) type II, (3) type III, (4) type IV, (5) multipass transmembrane, (6) lipid-chain-anchored membrane, (7) GPI-anchored membrane, and (8) peripheral membrane. As shown in the figure, types I, II, III, and IV are all of single-pass transmembrane proteins; see Spiess (1995) for a detailed description about their difference. Schematic illustration to show the 22 subcellular locations of eukaryotic proteins: (1) acrosome, (2) cell wall, (3) centriole, (4) chloroplast, (5) cyanelle, (6) cytoplasm, (7) cytoskeleton, (8) endoplasmic reticulum, (9) endosome, (10) extracellular, (11) Golgi apparatus, (12) hydrogenosome, (13) lysosome, (14) melanosome, (15) microsome (16) mitochondria, (17) nucleus, (18) peroxisome, (19) plasma membrane, (20) plastid, (21) spindle pole body, and (22) vacuole. To avoid homology bias and redundancy, it is important to introduce a cutoff threshold when constructing a benchmark dataset. Different cutoff threshold values were used, such as 90% (Reinhardt and Hubbard, 1998), 80% (Small et al., 2004), 40% (Shen and Chou, 2007a), and 25% (Chou and Shen, 2010a, Chou and Shen, 2010c). When a benchmark dataset was constructed with the cutoff threshold of 25%, none of the proteins included would have ≥25% pairwise sequence identity to any other in the same subset (category). Accordingly, the smaller the cutoff threshold is, the more stringent the benchmark dataset will be in excluding the homology bias. The benchmark datasets constructed in the earlier stage (see, e.g., Cedano et al., 1997, Chou, 1989, Nakashima et al., 1986) usually consisted of a learning (or training) dataset and an independent testing dataset, as can be formulated aswhere is the learning dataset, the training dataset, the empty set, and the symbol for “intersection” in the set theory. The learning dataset is used for training the predictor's “engine”, while the testing dataset used for evaluating the predictor's accuracy via a cross-validation. As we can see from Eq. (2), none of the proteins in the testing dataset should occur in the learning dataset . Therefore, is also called an independent dataset for performing cross-validation. However, as will be shown later, there is no need to artificially separate the benchmark dataset into a learning dataset and a testing dataset when the cross-validation is performed by the jackknife test, in which case one benchmark dataset can serve both the training and testing purposes.

Protein sample representation

Two kinds of models were usually used to represent protein samples. One is the sequential model, and the other the discrete model. The most straightforward sequential model for a protein sample is its entire amino acid sequence, as expressed bywhere R1 represents the 1st residue of the protein P, R2 the 2nd residue,…, R the L-th residue, and they each belong to one of the 20 native amino acid types. To get the desired results, the sequence-similarity-search-based tools, such as BLAST (Altschul, 1997, Wootton and Federhen, 1993), are usually utilized to conduct the prediction. However, this kind of approach failed to work when the query protein did not have significant sequence similarity to any attribute-known proteins. Thus, various non-sequential models, or discrete models, were proposed, as illustrated below. The simplest discrete model used to represent a protein sample is its amino acid (AA) composition or AAC (Nakashima et al., 1986). According to the AAC-discrete model, the protein P of Eq. (3) can be expressed by (Chou, 1995a)where f (i=1, 2 ,…,20) are the normalized occurrence frequencies of the 20 native amino acids in P, and T the transposing operator. Many methods for predicting various protein attributes were based on the AAC-discrete model (see, e.g., Cedano et al., 1997, Chou, 1999, Chou, 2000, Chou, 2005b, Chou and Zhang, 1992, Chou and Zhang, 1995, Chou and Maggiora, 1998, Chou and Elrod, 1999, Chou and Elrod, 2002, Chou et al., 1998, Chou, 1989, Du et al., 2006, Feng et al., 2005, Jahandideh et al., 2007a, Klein, 1986, Klein and Delisi, 1986, Liu and Chou, 1998, Metfessel et al., 1993, Nakashima and Nishikawa, 1994, Niu et al., 2006, Zhou, 1998, Zhou and Assa-Munt, 2001, Zhou and Doctor, 2003). However, as one can see from Eq. (4), all the sequence-order effects would be missing using the AAC-discrete model, and hence the prediction quality thus obtained might be limited. This is the main shortcoming of the AAC discrete model. To avoid completely losing the sequence-order information, a completely different discrete model, or the so-called “pseudo amino acid composition” (PseAAC) model (Chou, 2001), was proposed to represent the sample of a protein, as formulated bywhere the first 20 elements are associated with the 20 elements in Eq. (4) or the 20 amino acid components of the protein, while the additional Λ factors are used to incorporate some sequence-order information via various modes. Typically, these additional factors are a series of rank-different correlation factors along a protein chain, but they can also be any combinations of other factors so long as they can reflect some sorts of sequence-order effects in one way or the other. For the convenience of users, a web-server called “PseAAC” (Shen and Chou, 2008) was established at http://www.csbio.sjtu.edu.cn/bioinf/PseAAC/, by which some commonly used PseAAC forms can be automatically generated. The concept of PseAAC has been widely used to study various problems in proteins and protein-related systems, such as predicting enzymes and their family/sub-family classification (Cai and Chou, 2005, Cai et al., 2005, Qiu et al., 2010, Wang et al., 2010b, Zhou et al., 2007), protein subcellular location prediction (Cai and Chou, 2003, Chou and Cai, 2003c, Chou and Cai, 2004e, Gao et al., 2005, Li and Li, 2008b, Pan et al., 2003, Shi et al., 2007, Shi et al., 2008, Xiao et al., 2006b, Zhang et al., 2008c), apoptosis protein subcellular location prediction (Chen and Li, 2007, Jiang et al., 2008b, Kandaswamy et al., 2010, Lin et al., 2009a, Liu et al., 2010b), mycobacterial protein subcellular location prediction (Lin et al., 2008), predicting protein subnuclear localization (Jiang et al., 2008a, Li and Li, 2008a, Shen and Chou, 2005b), predicting protein subchloroplast locations (Du et al., 2009), predicting protein submitochondria locations (Du and Li, 2006, Nanni and Lumini, 2008, Zeng et al., 2009), predicting membrane proteins and their types (Cai and Chou, 2006, Chou and Shen, 2007d, Liu et al., 2005, Shen and Chou, 2005a, Shen et al., 2006, Wang et al., 2004, Wang et al., 2006), discrimination of outer membrane proteins (Gao et al., 2010, Lin, 2008), identifying transmembrane regions in proteins (Diao et al., 2008), identifying proteases and their types (Chou and Shen, 2008a, Zhou and Cai, 2006), predicting protein solubility (Xiaohui et al., 2010), identifying GPCRs and their classes (Gu et al., 2010a, Gu et al., 2010b, Lin et al., 2009b, Qiu et al., 2009, Xiao et al., 2009b, Xiao et al., 2010b), prediction of nuclear receptors (Gao et al., 2009), prediction of cyclin proteins (Mohabatkar, 2010), identifying bacterial secreted proteins (Yu et al., 2010), identifying risk type of human papillomaviruses (Esmaeili et al., 2010), prediction of cell wall lytic enzymes (Ding et al., 2009), prediction of lipases types (Zhang et al., 2008a), predicting conotoxin superfamily and family (Lin and Li, 2007a, Mondal et al., 2006), predicting the cofactors of oxidoreductases (Zhang and Fang, 2008), predicting DNA-binding proteins (Fang et al., 2008), predict protein structural classes (Chen et al., 2006a, Chen et al., 2006b, Ding et al., 2007, Li et al., 2009, Lin and Li, 2007b, Wu et al., 2010, Xiao et al., 2008a, Xiao et al., 2008b, Xiao et al., 2006a, Zhang and Ding, 2007, Zhang et al., 2008d), supersecondary structure prediction (Zou et al., 2011), protein secondary structure content prediction (Chen et al., 2009), predicting protein quaternary structural attributes (Chou and Cai, 2003a, Shen and Chou, 2009b, Xiao et al., 2009a, Xiao et al., 2010a, Zhang et al., 2008b, Zhang et al., 2006), fold pattern prediction (Shen and Chou, 2006, Shen and Chou, 2009a), and others (e.g., Georgiou et al., 2009). Meanwhile, various modes of PseAAC by extracting different features from protein sequences were proposed, including stochastic signal processing mode (Pan et al., 2003), Fourier spectrum analysis mode (Liu et al., 2005), special functions mode (Gao et al., 2005), complexity measure factor mode (Xiao et al., 2005, Xiao et al., 2006a), cellular automaton mode (Xiao et al., 2006b, Xiao et al., 2008b, Xiao et al., 2009b), geometric moments mode (Xiao et al., 2008b), gray dynamic mode (Xiao et al., 2008a), approximate entropy mode (Jiang et al., 2008a), continuous wavelet transform mode (Li et al., 2009), discrete wavelet transform mode (Qiu et al., 2009, Qiu et al., 2010), sequence-segmented mode (Zhang et al., 2008b), evolutionary information and von Neumann entropy mode (Zhang et al., 2008c), and so forth. However, according to its original concept, the essence of PseAAC is to keep using a discrete model to represent a protein yet without completely losing its sequence-order information. Therefore, in a broad sense, the PseAAC of a protein is actually a set of discrete numbers that is derived from its amino acid sequence and that is different from the classical AAC and able to harbor some sort of sequence order or pattern information. Therefore, the PseAAC for a protein P should be generally formulated aswhere the subscript Ω is an integer, and its value and the components ψ 1, ψ 2,… will depend on how to extract the desired information from the amino acid sequence of P (cf. Eq. (3)). The form of Eq. (6) can cover all the aforementioned modes of PseAAC. For example, whenwe immediately obtain the formulation of PseAAC originally introduced in Chou (2001), where the meanings for w, θ , and λ were clearly elaborated and hence there is no need to repeat here. Whenwe obtain the formulation for the amphiphilic PseAAC (Chou, 2005a), where the meanings of w, τ , and λ were also clearly given. It is instructive to point out that, with the general formulation of Eq. (6), the PseAAC can be used to reflect much more essential core features deeply hidden in complicated protein sequences through the following modes.

Functional domain mode

The functional domain (FunD) is the core of a protein. Therefore, in determining the 3-D (dimensional) structure of a protein by experiments (see, e.g., Call et al., 2010, Pielak and Chou, 2010, Schnell and Chou, 2008, Wang et al., 2009) or by computational modeling (see, e.g., Chou, 2004a, Chou, 2004b), the first priority was always focused on its FunD. Using the FunD information to formulate protein samples was originally proposed in Cai et al. (2003) and Chou and Cai (2002) based on the 2005 FunDs in the SBASE-A database (Murvai et al., 2001). Since then, a series of new protein FunD databases were established, such as COG (Tatusov et al., 2003), KOG (Tatusov et al., 2003), SMART (Letunic et al., 2006), Pfam (Finn et al., 2006), and CDD (Marchler-Bauer et al., 2007). Of these databases, CDD contains the domains imported from COG, Pfam, and SMART, and hence is relatively much more complete (Marchler-Bauer et al., 2007) and was adopted in most of the recent publications (see, e.g., Chou and Shen, 2010a, Chou and Shen, 2010c, Shen and Chou, 2009d). The version 2.11 of CDD contains 17,402 characteristic domains. Thus, when using the general formulation of PseAAC (Eq. (6)) to incorporate the FunD information, we have Ω=17,402, i.e.where T has the same meaning as in Eq. (4), and For the detailed procedure of how to find the hit for P in CDD, refer to Chou and Shen (2010a). Similar approaches of representing protein samples with the FunD mode were also used for predicting protein subcellular localization (Chou and Cai, 2002, Chou and Cai, 2004d), membrane protein types (Cai and Chou, 2006, Cai et al., 2003), enzyme functional classes (Shen and Chou, 2007a), protease types (Chou and Shen, 2008a, Shen and Chou, 2009c), GPCRs types (Xiao et al., 2010b), protein structural class (Chou and Cai, 2004b), protein fold pattern (Shen and Chou, 2009a), and protein quaternary structural attributes (Shen and Chou, 2009b, Xiao et al., 2009a, Xiao et al., 2010a).

Gene ontology mode

Gene ontology (GO) database (Ashburner et al., 2000) was established according to the molecular function, biological process, and cellular component. Accordingly, protein samples defined in a GO database space would be clustered in a way better reflecting some of their important attributes, such as subcellular localization and biological function (Chou and Shen, 2007c, Chou and Shen, 2008b). The GO database (version 70.0 released 10 March 2008) contains 60,020 GO numbers. Thus, when using the general formulation of PseAAC to incorporate the GO information, we have Ω=60,020, i.e.where For the detailed procedure of how to find the hit for P in the GO database, refer to Chou and Shen (2010a). The information extracted from the GO database (Ashburner et al., 2000, Camon et al., 2004, Harris et al., 2004) was used to formulate PseAAC for predicting protein subcellular localization (Cai and Chou, 2003, Chou and Cai, 2003b, Chou and Cai, 2004d, Chou and Shen, 2006a, Chou and Shen, 2006b, Chou and Shen, 2006c, Chou and Shen, 2007a, Chou and Shen, 2007b, Chou and Shen, 2007c, Chou and Shen, 2008b, Lee et al., 2005, Shen and Chou, 2007b, Shen and Chou, 2007c, Shen and Chou, 2007d, Shen et al., 2007), enzyme functional class (Chou and Cai, 2004a, Chou and Cai, 2004c), membrane protein types (Chou and Cai, 2005), protease types (Zhou and Cai, 2006), and protein–protein interactions (Chou and Cai, 2006).

Sequential evolution mode

Biology is a natural science with historic dimension. All biological species have developed continuously starting out from a very limited number of ancestral species. It is true for protein sequence as well (Chou, 2004b). Their evolution involves changes of single residues, insertions, and deletions of several residues (Chou, 1995b), gene doubling, and gene fusion. With these changes accumulated for a long period of time, many similarities between initial and resultant amino acid sequences are gradually eliminated, but the corresponding proteins may still share many common attributes, such as having basically the same biological function and residing in the same subcellular location. The general formulation of PseAAC can be used to incorporate this kind of information via its sequential evolution mode, i.e.wherewhere λ is an uncertain number that will be further discussed later, L is the length of P (counted in the total number of its constituent amino acids), and E represents the score of the amino acid residue in the i-th position of the protein sequence being changed to amino acid type j during the evolutionary process (Schaffer et al., 2001), which can be derived by using PSI-BLAST (Schaffer et al., 2001) to search the Swiss-Prot database as described in Chou and Shen (2010c). Here, the numerical codes 1, 2,…,20 are used to denote the 20 native amino acid types according to the alphabetical order of their single character codes. The above equations were used to identify membrane proteins and their types (Chou and Shen, 2007d), enzymes and their functional classes (Shen and Chou, 2007a), proteases and their types (Chou and Shen, 2008a), protein quaternary structural attributes (Shen and Chou, 2009b), as well as protein subcellular localization (Chou and Shen, 2010a, Chou and Shen, 2010b). Besides the aforementioned PseAAC modes, there may be some other feature extraction methods to represent protein samples, but they can always be formulated with the form of Eq. (6), the general formulation of PseAAC. It is instructive to point out that, regardless of which kind of PseAAC mode is adopted for protein samples, the query proteins and the proteins used to train the prediction engine must be defined in the same infrastructural frame with exactly the same dimension. For instance, if a query protein is defined in the 17402-D FunD space (see Eq. (9)), then the prediction should be carried out based on those proteins in the training set that can be defined in the exactly same 17402-D FunD space as well. If a query protein is defined in the 60020-D GO space (see Eq. (11)), then the prediction should be carried out based on those proteins in the training set that can be defined in the exactly same 60020-D GO space as well. If the query protein in both the 17402-D FunD space and 60020-D GO space is a naught vector and hence must be defined instead in the sequential evolution space (see Eq. (13)), then all the proteins used to train the prediction engine must also be formulated in the same sequential evolution space. It is particularly important to follow such a self-consistency principle when hybridizing different PseAAC modes or building an ensemble classifier by fusing many individual classifiers (Chou and Shen, 2006d).

Prediction algorithm (operating engine)

The problem of predicting protein attributes can be generally described as follows. Suppose a system containing N proteins (P 1,P 2,…,P ), which have been classified into M subsets (categories) as formulated by Eq. (1), where each subset S (m=1,2,…,M) is composed of proteins with the same attribute category and its size (the number of proteins therein) is N . Obviously, we have N=N 1+N 2+⋯+N . According to Eq. (6), we can suppose without losing generality that the k-th protein in the subset S (see Eq. (1)) is expressed bywhere is the j-th component of the k-th protein in . Now, for a query protein P as defined by Eq. (6), how can we identify which subset it belongs to? Many different prediction algorithms have been introduced to address this problem, such as discriminant algorithm (Chou and Maggiora, 1998, Chou and Elrod, 1999), neural network algorithm (Cai et al., 2000, Cai et al., 2001), support vector machine (SVM) (Cai et al., 2003, Cai et al., 2004, Chou and Cai, 2002), and K-nearest Neighbor algorithm (Cai and Chou, 2003, Chou and Shen, 2006b). In this paper we shall focus on the K-nearest neighbor algorithm (Denoeux, 1995) and show how to generate a powerful ensemble classifier by fusing many individual basic classifiers characterized with different control parameters. The K-nearest neighbor (KNN) classifier is quite popular in pattern recognition community owing to its good performance and simple-to-use feature. According to the KNN rule (Denoeux, 1995, Keller et al., 1985), named also as the “voting KNN rule”, the query protein should be assigned to the subset represented by a majority of its K nearest neighbors, as illustrated in Fig. 5

Fig. 5

Illustration to show how the KNN classifier depends on the selection of parameter K in identifying the attribute category of a query protein, where the query protein P is represented by the character q with a filled circle, proteins belonging to subset (category 1) are represented by the open circle with number 1, proteins of by the open circle with number 2, and so forth. When K=1, the query protein is predicted belonging to category 2 as its nearest protein does; when K=3, the query protein is predicted belonging to category 3 because two of its three nearest proteins belong to that category; when K=9, the query protein is predicted belonging to category 2 again because the majority of its nine nearest proteins belong to category 2. There are many different definitions to measure the “nearness” for the KNN classifier, such as Euclidean distance, Hamming distance (Mardia et al., 1979), and Mahalanobis distance (Chou, 1995a, Mahalanobis, 1936, Pillai, 1985). Usually, the following equation was adopted to measure the nearness between proteins P and (cf. Eqs. (6), (15)):where is the dot product of the two vectors, and ‖P‖ and their modulus, respectively. According to Eq. (16), when we have , indicating the “distance” between these two proteins is zero and hence they have perfect or 100% similarity. In using the KNN rule, the predicted result will depend on the selection of the parameter K, the number of the nearest neighbors to the query protein P, as described below.

Nearest neighbor classifier

The nearest neighbor classifier (Cover and Hart, 1967), also called NN classifier, is a special case of KNN classifier with K=1 (Fig. 5). With the NN classifier, the protein P will be predicted belonging to the same attribute category of the protein in the learning dataset that has the shortest “distance” to P, i.e., the query protein will be classified in the μ-th attribute category ifwhere means taking the minimum value of for the proteins in the subset (cf. Eqs. (1) and (16)), and the operator arg min means taking the argument of m that minimizes the quantity right after the operator. In other words, μ in Eq. (17) is equal to the argument of m that minimizes . If there are two and more arguments leading to the same minimum value, the query protein will be randomly assigned to one of the subsets associated with these arguments although this kind of tie case rarely happens. Owing to its simplicity and apparent efficiency, the NN classifier is still a favorite method used by many investigators (see, e.g., Chen et al., 2010, He et al., 2010, Huang et al., 2010).

KNN classifier

With the KNN classifier when K>1, the attribute of the query protein P will be determined by the majority of its K nearest neighbors via a vote (Fig. 5), as can be formulated as follows. Suppose are the K proteins in that have the closest distances to P, the query protein will be predicted belonging to the μ-th subset (attribute category) ifwhere μ is the argument of m that maximize andwhere ∈ is a symbol in the set theory meaning “member of”. If there is a tie for the voting results, the query protein will be randomly assigned to one of the locations associated with the tie case. Generally speaking, the greater the K (the number of the nearest neighbors counted), the less likely the tie case occurs. As mentioned above, the sequential evolution PseAAC mode of Eq. (13) contains a parameter λ, which is associated with what tier of sequence correlation is taken into account for the PseAAC. As we can see from Eq. (14), the only constraint to λ is that it must be smaller than L, the number of the amino acids in the protein concerned. Suppose the length of the shortest protein investigated is 50, then λ can be any of the following 50 numbers: 0, 1, 2,…,49. Although in principle we can include all these possibilities for λ by enlarging the dimension of the PseAAC to contain 20×50=1000 components, it may cause various unfavorable problems for statistical prediction, such as “high dimension disaster” and “overfitting redundancy” (Wang et al., 2008a). Actually, it may reduce the cluster-tolerant capacity (Chou, 1999) and lower down the success rate of cross-validation if the PseAAC contains too many trivial components. Accordingly, for a given training dataset, there is an optimal number for λ. However, it would be time-consuming and tedious to find the optimal λ by changing its value and doing tests one-by-one. Likewise, the KNN classifier (cf. Eq. (18)) also contains a parameter K, the number of the nearest neighbors to a query protein (Fig. 5). It will affect the predicted result by choosing a different value for K. In other words, for a given training dataset, there is an optimal value for K as well. The parameters such as λ and K are called uncertain parameters. The number of the uncertain parameters depends on which model is used to represent the protein samples and what classifier is used for the prediction engine. It can be seen from Eqs. (9), (11), (13), and (18) that one uncertain parameter, K, needs to be determined if using KNN classifier based on the FunD (or GO) mode of PseAAC, and that two uncertain parameters, K and λ, need to be determined if using KNN classifier based on the sequential evolution mode. It would be much more tedious and time-consuming to determine the optimal values for two uncertain parameters. To deal with this kind of uncertain parameters, let us introduce the fusion approach.

One-dimensional fusion

For most cases in using the KNN classifier to predict protein attributes, when K>20, the success rate by the KNN classifier would decrease remarkably. Therefore, the basic individual classifiers to be considered can be generally expressed aswhere represents the KNN classifier that is a function of K, the symbol is the identification operator meaning using to identify the attribute of the query protein P among the M subsets of in Eq. (1). Suppose the accumulated score thus obtained (with K=1,2,…,20) for the protein P belonging to the m-th subset is given bywhere Thus the query protein P is predicted belonging to the subset with which its score of Eq. (21) is the highest, i.e., the query protein P is identified as belonging to the μ-th subset ifwhere μ is the argument of m that maximizes the score function of Eq. (21). If there are two and more arguments leading to the same maximum value, the query protein will be randomly assigned to one of the subset associated with these arguments although this kind of tie case rarely happens.

Two-dimensional fusion

When the KNN classifier is operated on the query protein formulated with the sequential evolution mode (cf. Eq. (13)), we are facing a problem with two uncertain parameters, K and λ. In general, the shortest protein sequence investigated is 50 amino acids (Chou and Shen, 2008a, Chou and Shen, 2010c), hence we can set the maximum value allowed for λ is 49. Thus, the basic individual classifiers to be considered would become as follows:and the corresponding accumulated score for the query protein belonging to the m-th subset is given bywhereand the query protein is predicted belonging to the subset with which its score of Eq. (25) is the highest, i.e., the query protein P is identified as belonging to the μ-th subset ifwhere μ is the argument of m that maximizes the score function of Eq. (25). If there are two and more arguments leading to the same maximum value, the query protein will be randomly assigned to one of the subcellular locations associated with these arguments although this kind of tie case rarely happens. If a basic individual classifier involves with three or more uncertain parameters, by following the similar procedures as described above, we can perform three or higher dimensional fusion.

Cross-validation test

After a prediction method has been developed, a subsequent and natural question to ask is: What is its accuracy? In statistical prediction, it would be meaningless to simply say a success rate of a predictor without specifying what cross-validation method and benchmark dataset were used to test its accuracy. In literatures, the following three cross-validation methods are generally used for examining the effectiveness of a statistical prediction method: (1) the independent dataset test, (2) the subsampling (Γ-fold such as 5- or 10-fold cross-validation) test, and (3) the jackknife test (Chou and Zhang, 1995). For the independent dataset test, although all the proteins used to test the predictor are outside the training dataset used to train it so as to exclude the “memory” effect or bias, the way of how to select the independent proteins to test the predictor could be quite arbitrary unless the number of independent proteins is sufficiently large. This kind of arbitrariness might result in completely different conclusions. For instance, a predictor achieving a higher success rate than the other predictor for a given independent testing dataset might fail to keep so when tested by another independent testing dataset (Chou and Zhang, 1995). Accordingly, the independent dataset test is not a fairly objective test method although it was often used to demonstrate the practical application of a predictor (see, e.g., Cedano et al., 1997, Chou and Elrod, 1999, Chou and Shen, 2006c, Chou and Shen, 2007a). For the subsampling test, the concrete procedure usually used in literatures is the 5-fold, 7-fold, or 10-fold cross-validation. The problem with the Γ-fold cross-validation test as such is that the number of possible selections in dividing a benchmark dataset is an astronomical figure even for a very simple dataset. This is because for a benchmark dataset as formulated in Eq. (1), the number of possible combinations of taking one Γ-th or 1/Γ proteins from each of the subsets in Eq. (1) will bewherewhere N is the number of proteins in the m-th subset , and the symbol Int is the integer-truncating operator meaning to take the integer part for the number in the brackets right after it. For example, without losing generality let us consider the case of 5-fold cross-validation (i.e., Γ=5) for a very simple benchmark dataset that contains 250 proteins, of which N 1=65 belongs to subset , N 2=60 to subset , N 3=55 to subset , and N 4=70 to subset . Substituting these figures into Eqs. (28), (29), we have that the number of possible combinations of taking one-fifth proteins from each of the four subsets will beindicating that for such a simple and small benchmark dataset, the number of possible combinations of taking one-fifth proteins from each of the four subsets for 5-fold cross-validation will be an astronomical number. Now let us consider a moderate-size dataset that consists of 640 proteins classified into M=8 subsets with each containing 80 proteins, i.e., N 1=N 2=⋯=N 8=80. According to Eqs. (28), (29), the number of possible combinations of taking one-fifth proteins from each of the 8 subsets for 5-fold-cross-validation will be If the above benchmark dataset is slightly larger and complicated, i.e., the number of proteins is increased from 640 to 800, and the number of subsets from 8 to 10 with each still containing 80 proteins, then the number of possible combinations of taking one-fifth proteins from each of the 10 subsets for 5-fold-cross-validation will be Actually, many typical benchmark datasets contain more than 1000 proteins (see, e.g., Chou and Shen, 2008a, Chou and Shen, 2010a, Chou and Shen, 2010c). Therefore, in any actual subsampling cross-validation tests, only an extremely small fraction of the possible selections are taken into account. Since different selections will always lead to different results even for a same benchmark dataset and a same predictor, the subsampling test (such as 5-fold cross-validation) cannot avoid the arbitrariness either. A test method unable to yield a unique outcome cannot be deemed as an ideal one. In the jackknife test, all the proteins in the benchmark dataset will be singled out one-by-one and tested by the predictor trained by the remaining protein samples. During the process of jackknifing, both the training dataset and testing dataset are actually open, and each protein sample will be in turn moved between the two. The jackknife test can exclude the “memory” effect. Also, the arbitrariness problem as mentioned above for the independent dataset test and subsampling test can be avoided because the outcome obtained by the jackknife cross-validation is always unique for a given benchmark dataset. As for the possible overestimation in success rate by jackknife test because of only one sample being singled out at a time for testing, the answer is that as long as the jackknife test is performed on a stringent benchmark dataset in which none of proteins has ≥25% pairwise sequence identity to any other in a same subset such as those mentioned in the Section 2, it is highly unlikely to yield an overestimated rate compared with the actual success rate in practical applications, as demonstrated in Chou and Shen ( 2010c) and Shen and Chou (2010). Besides, when the jackknife test was used to compare two predictors, even if there was some overestimate due to using a less stringent benchmark dataset for one predictor, the same overestimate would exist for the other as long as they were both tested by the same dataset. Accordingly, the jackknife test has been increasingly and widely used by investigators to examine the quality of various predictors (see, e.g., Anand and Suganthan, 2009, Cai et al., 2010, Chen et al., 2008a, Chen et al., 2008b, Chen and Han, 2009, Du and Li, 2008, Du et al., 2009, Fang et al., 2008, Feng and Luo, 2008, Gu and Chen, 2009, Gu et al., 2010a, Jahandideh et al., 2007a, Jahandideh et al., 2007b, Jahandideh et al., 2009, Ji et al., 2010, Kannan et al., 2008, Li et al., 2009, Lin, 2008, Lin et al., 2009a, Liu et al., 2010a, Munteanu et al., 2008, Nanni and Lumini, 2008, Nanni and Lumini, 2009, Rezaei et al., 2008, Shao et al., 2009, Shi et al., 2008, Shi and Hu, 2010, Vilar et al., 2009, Wang and Yang, 2010, Wang et al., 2010a, Wang et al., 2008b, Yang and Jiang, 2010, Yang et al., 2009, Yang et al., 2010, Zhao et al., 2008, Zhou et al., 2008). However, even if using the jackknife approach for cross-validation, the same predictor may still generate obviously different success rates when tested by different benchmark datasets. This is because the more the stringent of a benchmark dataset in excluding homologous and high similarity sequences, the more the difficult for a predictor to achieve a high overall success rate (Chou and Shen, 2010a). Also, the more the number of subsets (attribute categories) a benchmark dataset covers, the more the difficult to achieve a high overall success rate. This can be easily conceivable via the following consideration. Suppose a benchmark dataset consists of two subsets (attribute categories) with each containing the same number of proteins. The overall success rate in identifying their attribute categories by random assignment would be 1/2=50%. However, for a benchmark dataset consisting of 20 subsets, the corresponding overall success rate by the random assignment would be 1/20=5%, which is only one-tenth of the former.

Web-server

Even if a powerful predictor has been developed by accomplishing the above four procedures, namely constructing a valid benchmark dataset, formulating protein samples with PseAAC to successfully catch their essential and core features, introducing a powerful and efficient algorithm or engine to operate the prediction, and achieving a high overall success rate by jackknife test on a stringent dataset in which none of the proteins included has ≥25% pairwise sequence identity to any other in the same subset (attribute category), it does not mean that the predictor has been really completed. This is because we are living in the Internet Age. To make a new prediction method really useful for the majority of people, it is an important direction or necessary procedure to provide a user-friendly and publicly accessible web-server for the method (Chou and Shen, 2009). Technically speaking, a web-server means a computer program that is responsible for accepting Hypertext Transfer Protocol (HTTP) requests from clients. By means of web-servers, many computational prediction methods, regardless how difficult their mathematics or how complicated their algorithms are, can be easily used by the vast majority of scientists to generate their desired data without the need to understand the mathematical details.

Conclusion and perspectives

In order to timely utilize the huge amount of newly discovered protein sequences generated in the postgenomic era for basic research and drug development, scientists are anxious to know their biological attributes. Many studies from various research laboratories around the world have indicated that mathematical analysis, computational modeling, and introducing novel physical concept to biology and medicine, such as graphical analysis (Andraos, 2008, Myers and Palmer, 1985, Zhou and Deng, 1984), modeling three-dimensional structures of targeted proteins/peptides for drug design (Sharma et al., 2008, Zhou and Troy, 2003, Zhou and Troy, 2005a, Zhou and Troy, 2005b, Zhou et al., 2004), diffusion-controlled reaction simulation (Zhou et al., 1981, Zhou and Zhong, 1982, Zhou et al., 1983), cellular responding kinetics (Qi et al., 2007), and biological functions of solitons in DNA (Zhou, 1989) can provide useful insights for both basic research and drug design and hence are widely welcome by science community. In view of this, it is highly desirable to develop automated methods by introducing new concepts and approaches for fast and accurately predicting the attributes of uncharacterized proteins based on their sequence information alone. During the past two decades or so, many statistical methods for predicting various protein attributes have been proposed. In this review, the key steps for establishing a powerful predictor in this regard have been analyzed in hopes that the points raised here may help stimulate the further development of new and more powerful predictors in this area. It is anticipated that the general form of PseAAC as formulated in this review may further stimulate the efforts to find various new modes of optimal PseAAC, which is one of the most important future directions we should focus on in order to substantially improve the power of predicting protein attributes.

200 in total

1. Using fourier spectrum analysis and pseudo amino acid composition for prediction of membrane protein types.

Authors: Hui Liu; Jie Yang; Meng Wang; Li Xue; Kuo-Chen Chou
Journal: Protein J Date: 2005-08 Impact factor: 2.371

2. Amino Acid Principal Component Analysis (AAPCA) and its applications in protein structural class prediction.

Authors: Qi-Shi Du; Zhi-Qin Jiang; Wen-Zhang He; Da-Peng Li; Kou-Chen Chou
Journal: J Biomol Struct Dyn Date: 2006-06

Review 3. Recent progress in protein subcellular location prediction.

Authors: Kuo-Chen Chou; Hong-Bin Shen
Journal: Anal Biochem Date: 2007-07-12 Impact factor: 3.365

4. Using the augmented Chou's pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach.

Authors: Yu-hong Zeng; Yan-zhi Guo; Rong-quan Xiao; Li Yang; Le-zheng Yu; Meng-long Li
Journal: J Theor Biol Date: 2009-03-31 Impact factor: 2.691

5. A classification-based prediction model of messenger RNA polyadenylation sites.

Authors: Guoli Ji; Xiaohui Wu; Yingjia Shen; Jiangyin Huang; Qingshun Quinn Li
Journal: J Theor Biol Date: 2010-05-26 Impact factor: 2.691

6. SCOP: a structural classification of proteins database for the investigation of sequences and structures.

Authors: A G Murzin; S E Brenner; T Hubbard; C Chothia
Journal: J Mol Biol Date: 1995-04-07 Impact factor: 5.469

7. Analysis and prediction of the metabolic stability of proteins based on their sequential features, subcellular locations and interaction networks.

Authors: Tao Huang; Xiao-He Shi; Ping Wang; Zhisong He; Kai-Yan Feng; Lele Hu; Xiangyin Kong; Yi-Xue Li; Yu-Dong Cai; Kuo-Chen Chou
Journal: PLoS One Date: 2010-06-04 Impact factor: 3.240

8. The modified Mahalanobis Discriminant for predicting outer membrane proteins by using Chou's pseudo amino acid composition.

Authors: Hao Lin
Journal: J Theor Biol Date: 2008-02-12 Impact factor: 2.691

9. Multiclass cancer classification by support vector machines with class-wise optimized genes and probability estimates.

Authors: Ashish Anand; P N Suganthan
Journal: J Theor Biol Date: 2009-05-03 Impact factor: 2.691

10. Structure and mechanism of the M2 proton channel of influenza A virus.

Authors: Jason R Schnell; James J Chou
Journal: Nature Date: 2008-01-31 Impact factor: 49.962

305 in total

1. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition.

Authors: Hao Lin; En-Ze Deng; Hui Ding; Wei Chen; Kuo-Chen Chou
Journal: Nucleic Acids Res Date: 2014-10-31 Impact factor: 16.971

2. SySAP: a system-level predictor of deleterious single amino acid polymorphisms.

Authors: Tao Huang; Chuan Wang; Guoqing Zhang; Lu Xie; Yixue Li
Journal: Protein Cell Date: 2011-12-19 Impact factor: 14.870

3. QSAR classification of metabolic activation of chemicals into covalently reactive species.

Authors: Chin Yee Liew; Chuen Pan; Andre Tan; Ke Xin Magneline Ang; Chun Wei Yap
Journal: Mol Divers Date: 2012-02-28 Impact factor: 2.943

4. Analysis of secondary structure in proteins by chemical cross-linking coupled to MS.

Authors: Mariana Fioramonte; Aline Mara dos Santos; Sean McIlwain; William S Noble; Kleber G Franchini; Fabio C Gozzo
Journal: Proteomics Date: 2012-08 Impact factor: 3.984

5. Prediction of Protein-Protein Interaction Sites with Machine-Learning-Based Data-Cleaning and Post-Filtering Procedures.

Authors: Guang-Hui Liu; Hong-Bin Shen; Dong-Jun Yu
Journal: J Membr Biol Date: 2015-11-12 Impact factor: 1.843

6. Predicting the Functional Types of Singleplex and Multiplex Eukaryotic Membrane Proteins via Different Models of Chou's Pseudo Amino Acid Compositions.

Authors: Hong-Liang Zou; Xuan Xiao
Journal: J Membr Biol Date: 2015-10-12 Impact factor: 1.843

7. iRSpot-GAEnsC: identifing recombination spots via ensemble classifier and extending the concept of Chou's PseAAC to formulate DNA samples.

Authors: Muhammad Kabir; Maqsood Hayat
Journal: Mol Genet Genomics Date: 2015-08-30 Impact factor: 3.291

8. Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework.

Authors: Yanju Zhang; Ruopeng Xie; Jiawei Wang; André Leier; Tatiana T Marquez-Lago; Tatsuya Akutsu; Geoffrey I Webb; Kuo-Chen Chou; Jiangning Song
Journal: Brief Bioinform Date: 2019-11-27 Impact factor: 11.622

9. Imbalanced multi-label learning for identifying antimicrobial peptides and their functional types.

Authors: Weizhong Lin; Dong Xu
Journal: Bioinformatics Date: 2016-08-26 Impact factor: 6.937

10. Large-scale comparative assessment of computational predictors for lysine post-translational modification sites.

Authors: Zhen Chen; Xuhan Liu; Fuyi Li; Chen Li; Tatiana Marquez-Lago; André Leier; Tatsuya Akutsu; Geoffrey I Webb; Dakang Xu; Alexander Ian Smith; Lei Li; Kuo-Chen Chou; Jiangning Song
Journal: Brief Bioinform Date: 2019-11-27 Impact factor: 11.622