Kuo-Chen Chou1. 1. Gordon Life Science Institute, 13784 Torrey Del Mar Drive, San Diego, CA 92130, USA. kcchou@gordonlifescience.org
Abstract
With the accomplishment of human genome sequencing, the number of sequence-known proteins has increased explosively. In contrast, the pace is much slower in determining their biological attributes. As a consequence, the gap between sequence-known proteins and attribute-known proteins has become increasingly large. The unbalanced situation, which has critically limited our ability to timely utilize the newly discovered proteins for basic research and drug development, has called for developing computational methods or high-throughput automated tools for fast and reliably identifying various attributes of uncharacterized proteins based on their sequence information alone. Actually, during the last two decades or so, many methods in this regard have been established in hope to bridge such a gap. In the course of developing these methods, the following things were often needed to consider: (1) benchmark dataset construction, (2) protein sample formulation, (3) operating algorithm (or engine), (4) anticipated accuracy, and (5) web-server establishment. In this review, we are to discuss each of the five procedures, with a special focus on the introduction of pseudo amino acid composition (PseAAC), its different modes and applications as well as its recent development, particularly in how to use the general formulation of PseAAC to reflect the core and essential features that are deeply hidden in complicated protein sequences.
With the accomplishment of human genome sequencing, the number of sequence-known proteins has increased explosively. In contrast, the pace is much slower in determining their biological attributes. As a consequence, the gap between sequence-known proteins and attribute-known proteins has become increasingly large. The unbalanced situation, which has critically limited our ability to timely utilize the newly discovered proteins for basic research and drug development, has called for developing computational methods or high-throughput automated tools for fast and reliably identifying various attributes of uncharacterized proteins based on their sequence information alone. Actually, during the last two decades or so, many methods in this regard have been established in hope to bridge such a gap. In the course of developing these methods, the following things were often needed to consider: (1) benchmark dataset construction, (2) protein sample formulation, (3) operating algorithm (or engine), (4) anticipated accuracy, and (5) web-server establishment. In this review, we are to discuss each of the five procedures, with a special focus on the introduction of pseudo amino acid composition (PseAAC), its different modes and applications as well as its recent development, particularly in how to use the general formulation of PseAAC to reflect the core and essential features that are deeply hidden in complicated protein sequences.
With the explosive growth of protein sequences generated in the postgenomic age, scientists are anxious to know their attributes because they are closely correlated with the structures and functions of the proteins as well as their roles in biological processes, and hence are very important to both basic research and drug target development. For instance, given an uncharacterized protein sequence, what is its folding rate? Which structural class and quaternary structural attribute does it belong to? Which subcellular location site does it resides? Can it simultaneously exist in or move between two and more subcellular locations? How can we identify it as an enzyme or non-enzyme? If it is an enzyme, to which enzyme functional class does it belong? Is it a membrane protein or non-membrane protein? If the former, to which membrane protein type does it belong? Is it a protease? If it is, to which protease type does it belong? Is it a G protein-coupled receptor (GPCR)? If it is, to which GPCR type does it belong? Which part of the protein serves as its signal sequence? Where are its cleavage sites by proteases such as HIV (human immunodeficiency virus) protease and SARS (severe acute respiratory syndrome) enzyme? And so forth. Although the answers to these questions can be determined by conducting various biochemical experiments, it is both time-consuming and costly by relying on experimental approaches alone. As a consequence, the gap between the number of newly discovered protein sequences and the knowledge of their attributes is continuing to expand. To bridge such a gap and acquire these kinds of information in a timely manner, scientists are challenged to develop computational methods for predicting various attributes of proteins based on their sequence information alone.To establish a really useful predictor in this regard, one usually needs to accomplish the following procedures: (1) construct a valid benchmark dataset to train and test the predictor; (2) formulate the protein samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the attribute to be predicted; (3) introduce or develop a powerful algorithm (or engine) to operate the prediction; (4) properly perform cross-validation tests to objectively evaluate the accuracy of the predictor; and (5) establish a user-friendly web-server for the predictor that is accessible to the public.This review will discuss each of the above five procedures, with a special focus on procedure 2, particularly on how to use various different modes of pseudo amino acid composition to represent protein samples by incorporating their core and essential features.
Benchmark dataset
To develop a statistical prediction method for a given attribute, the first important thing is to construct a benchmark dataset according to its possible classification, i.e.where represents the subset for category 1 of the attribute, for category 2, and so forth; while represents the symbol for “union” in the set theory, and M the number of different categories for the attribute concerned. For example, when the attribute concerned was about the protein structural classification as investigated in Chou (1995a), Chou and Zhang (1994), Chou (1989), Levitt and Chothia (1976), Nakashima et al. (1986) and Zhou (1998), M would be four as illustrated in
Fig. 1; when the structural classification was defined according to the SCOP database (Murzin et al., 1995) or investigated in Chou and Cai (2004b), M would be seven as shown in
Fig. 2; when the attribute was about the membrane protein type as investigated in Chou and Shen (2007d), M would be eight (Chou and Shen, 2007d) as illustrated in
Fig. 3; when the attribute was about the subcellular localization of eukaryotic proteins as investigated in Chou and Shen (2010a), M would be 22 as illustrated in
Fig. 4.
Fig. 1
Illustration to show the four categories of protein structural class: (a) all-α, (b) all-β, (c) α/β, and (d) α+β, where the α-helix is colored in red, β-strand in yellow, and the other in green. The PDB codes used to draw the representatives of the four structural classes are 1aep, 1gbg, 1enp, and 1aak, respectively. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Fig. 2
Illustration to show the seven categories of protein structural class: (a) all-α, (b) all-β, (c) α/β, (d) α+β, (e) μ (multi-domain), (f) σ (small protein), and (g) ρ (peptide), where the α-helix is colored in red, β-strand in yellow, and the other in green. The PDB codes used to draw the representatives of the seven structural classes are 1a6m, 1uzv, 2f62, 2bf5, 1vqq, 4hir, and 1ter, respectively. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Fig. 3
Schematic drawings to show the eight categories of membrane protein types: (1) type I transmembrane, (2) type II, (3) type III, (4) type IV, (5) multipass transmembrane, (6) lipid-chain-anchored membrane, (7) GPI-anchored membrane, and (8) peripheral membrane. As shown in the figure, types I, II, III, and IV are all of single-pass transmembrane proteins; see Spiess (1995) for a detailed description about their difference.
Illustration to show the four categories of protein structural class: (a) all-α, (b) all-β, (c) α/β, and (d) α+β, where the α-helix is colored in red, β-strand in yellow, and the other in green. The PDB codes used to draw the representatives of the four structural classes are 1aep, 1gbg, 1enp, and 1aak, respectively. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)Illustration to show the seven categories of protein structural class: (a) all-α, (b) all-β, (c) α/β, (d) α+β, (e) μ (multi-domain), (f) σ (small protein), and (g) ρ (peptide), where the α-helix is colored in red, β-strand in yellow, and the other in green. The PDB codes used to draw the representatives of the seven structural classes are 1a6m, 1uzv, 2f62, 2bf5, 1vqq, 4hir, and 1ter, respectively. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)Schematic drawings to show the eight categories of membrane protein types: (1) type I transmembrane, (2) type II, (3) type III, (4) type IV, (5) multipass transmembrane, (6) lipid-chain-anchored membrane, (7) GPI-anchored membrane, and (8) peripheral membrane. As shown in the figure, types I, II, III, and IV are all of single-pass transmembrane proteins; see Spiess (1995) for a detailed description about their difference.Schematic illustration to show the 22 subcellular locations of eukaryotic proteins: (1) acrosome, (2) cell wall, (3) centriole, (4) chloroplast, (5) cyanelle, (6) cytoplasm, (7) cytoskeleton, (8) endoplasmic reticulum, (9) endosome, (10) extracellular, (11) Golgi apparatus, (12) hydrogenosome, (13) lysosome, (14) melanosome, (15) microsome (16) mitochondria, (17) nucleus, (18) peroxisome, (19) plasma membrane, (20) plastid, (21) spindle pole body, and (22) vacuole.To avoid homology bias and redundancy, it is important to introduce a cutoff threshold when constructing a benchmark dataset. Different cutoff threshold values were used, such as 90% (Reinhardt and Hubbard, 1998), 80% (Small et al., 2004), 40% (Shen and Chou, 2007a), and 25% (Chou and Shen, 2010a, Chou and Shen, 2010c). When a benchmark dataset was constructed with the cutoff threshold of 25%, none of the proteins included would have ≥25% pairwise sequence identity to any other in the same subset (category). Accordingly, the smaller the cutoff threshold is, the more stringent the benchmark dataset will be in excluding the homology bias.The benchmark datasets constructed in the earlier stage (see, e.g., Cedano et al., 1997, Chou, 1989, Nakashima et al., 1986) usually consisted of a learning (or training) dataset and an independent testing dataset, as can be formulated aswhere is the learning dataset, the training dataset, the empty set, and the symbol for “intersection” in the set theory. The learning dataset is used for training the predictor's “engine”, while the testing dataset used for evaluating the predictor's accuracy via a cross-validation. As we can see from Eq. (2), none of the proteins in the testing dataset should occur in the learning dataset . Therefore, is also called an independent dataset for performing cross-validation. However, as will be shown later, there is no need to artificially separate the benchmark dataset into a learning dataset and a testing dataset when the cross-validation is performed by the jackknife test, in which case one benchmark dataset can serve both the training and testing purposes.
Protein sample representation
Two kinds of models were usually used to represent protein samples. One is the sequential model, and the other the discrete model. The most straightforward sequential model for a protein sample is its entire amino acid sequence, as expressed bywhere R1 represents the 1st residue of the protein P, R2 the 2nd residue,…, R the L-th residue, and they each belong to one of the 20 native amino acid types. To get the desired results, the sequence-similarity-search-based tools, such as BLAST (Altschul, 1997, Wootton and Federhen, 1993), are usually utilized to conduct the prediction. However, this kind of approach failed to work when the query protein did not have significant sequence similarity to any attribute-known proteins. Thus, various non-sequential models, or discrete models, were proposed, as illustrated below.The simplest discrete model used to represent a protein sample is its amino acid (AA) composition or AAC (Nakashima et al., 1986). According to the AAC-discrete model, the protein P of Eq. (3) can be expressed by (Chou, 1995a)where f
(i=1, 2 ,…,20) are the normalized occurrence frequencies of the 20 native amino acids in P, and T the transposing operator. Many methods for predicting various protein attributes were based on the AAC-discrete model (see, e.g., Cedano et al., 1997, Chou, 1999, Chou, 2000, Chou, 2005b, Chou and Zhang, 1992, Chou and Zhang, 1995, Chou and Maggiora, 1998, Chou and Elrod, 1999, Chou and Elrod, 2002, Chou et al., 1998, Chou, 1989, Du et al., 2006, Feng et al., 2005, Jahandideh et al., 2007a, Klein, 1986, Klein and Delisi, 1986, Liu and Chou, 1998, Metfessel et al., 1993, Nakashima and Nishikawa, 1994, Niu et al., 2006, Zhou, 1998, Zhou and Assa-Munt, 2001, Zhou and Doctor, 2003). However, as one can see from Eq. (4), all the sequence-order effects would be missing using the AAC-discrete model, and hence the prediction quality thus obtained might be limited. This is the main shortcoming of the AAC discrete model. To avoid completely losing the sequence-order information, a completely different discrete model, or the so-called “pseudo amino acid composition” (PseAAC) model (Chou, 2001), was proposed to represent the sample of a protein, as formulated bywhere the first 20 elements are associated with the 20 elements in Eq. (4) or the 20 amino acid components of the protein, while the additional Λ factors are used to incorporate some sequence-order information via various modes. Typically, these additional factors are a series of rank-different correlation factors along a protein chain, but they can also be any combinations of other factors so long as they can reflect some sorts of sequence-order effects in one way or the other. For the convenience of users, a web-server called “PseAAC” (Shen and Chou, 2008) was established at http://www.csbio.sjtu.edu.cn/bioinf/PseAAC/, by which some commonly used PseAAC forms can be automatically generated.The concept of PseAAC has been widely used to study various problems in proteins and protein-related systems, such as predicting enzymes and their family/sub-family classification (Cai and Chou, 2005, Cai et al., 2005, Qiu et al., 2010, Wang et al., 2010b, Zhou et al., 2007), protein subcellular location prediction (Cai and Chou, 2003, Chou and Cai, 2003c, Chou and Cai, 2004e, Gao et al., 2005, Li and Li, 2008b, Pan et al., 2003, Shi et al., 2007, Shi et al., 2008, Xiao et al., 2006b, Zhang et al., 2008c), apoptosis protein subcellular location prediction (Chen and Li, 2007, Jiang et al., 2008b, Kandaswamy et al., 2010, Lin et al., 2009a, Liu et al., 2010b), mycobacterial protein subcellular location prediction (Lin et al., 2008), predicting protein subnuclear localization (Jiang et al., 2008a, Li and Li, 2008a, Shen and Chou, 2005b), predicting protein subchloroplast locations (Du et al., 2009), predicting protein submitochondria locations (Du and Li, 2006, Nanni and Lumini, 2008, Zeng et al., 2009), predicting membrane proteins and their types (Cai and Chou, 2006, Chou and Shen, 2007d, Liu et al., 2005, Shen and Chou, 2005a, Shen et al., 2006, Wang et al., 2004, Wang et al., 2006), discrimination of outer membrane proteins (Gao et al., 2010, Lin, 2008), identifying transmembrane regions in proteins (Diao et al., 2008), identifying proteases and their types (Chou and Shen, 2008a, Zhou and Cai, 2006), predicting protein solubility (Xiaohui et al., 2010), identifying GPCRs and their classes (Gu et al., 2010a, Gu et al., 2010b, Lin et al., 2009b, Qiu et al., 2009, Xiao et al., 2009b, Xiao et al., 2010b), prediction of nuclear receptors (Gao et al., 2009), prediction of cyclin proteins (Mohabatkar, 2010), identifying bacterial secreted proteins (Yu et al., 2010), identifying risk type of human papillomaviruses (Esmaeili et al., 2010), prediction of cell wall lytic enzymes (Ding et al., 2009), prediction of lipases types (Zhang et al., 2008a), predicting conotoxin superfamily and family (Lin and Li, 2007a, Mondal et al., 2006), predicting the cofactors of oxidoreductases (Zhang and Fang, 2008), predicting DNA-binding proteins (Fang et al., 2008), predict protein structural classes (Chen et al., 2006a, Chen et al., 2006b, Ding et al., 2007, Li et al., 2009, Lin and Li, 2007b, Wu et al., 2010, Xiao et al., 2008a, Xiao et al., 2008b, Xiao et al., 2006a, Zhang and Ding, 2007, Zhang et al., 2008d), supersecondary structure prediction (Zou et al., 2011), protein secondary structure content prediction (Chen et al., 2009), predicting protein quaternary structural attributes (Chou and Cai, 2003a, Shen and Chou, 2009b, Xiao et al., 2009a, Xiao et al., 2010a, Zhang et al., 2008b, Zhang et al., 2006), fold pattern prediction (Shen and Chou, 2006, Shen and Chou, 2009a), and others (e.g., Georgiou et al., 2009).Meanwhile, various modes of PseAAC by extracting different features from protein sequences were proposed, including stochastic signal processing mode (Pan et al., 2003), Fourier spectrum analysis mode (Liu et al., 2005), special functions mode (Gao et al., 2005), complexity measure factor mode (Xiao et al., 2005, Xiao et al., 2006a), cellular automaton mode (Xiao et al., 2006b, Xiao et al., 2008b, Xiao et al., 2009b), geometric moments mode (Xiao et al., 2008b), gray dynamic mode (Xiao et al., 2008a), approximate entropy mode (Jiang et al., 2008a), continuous wavelet transform mode (Li et al., 2009), discrete wavelet transform mode (Qiu et al., 2009, Qiu et al., 2010), sequence-segmented mode (Zhang et al., 2008b), evolutionary information and von Neumann entropy mode (Zhang et al., 2008c), and so forth.However, according to its original concept, the essence of PseAAC is to keep using a discrete model to represent a protein yet without completely losing its sequence-order information. Therefore, in a broad sense, the PseAAC of a protein is actually a set of discrete numbers that is derived from its amino acid sequence and that is different from the classical AAC and able to harbor some sort of sequence order or pattern information. Therefore, the PseAAC for a protein P should be generally formulated aswhere the subscript Ω is an integer, and its value and the components ψ
1, ψ
2,… will depend on how to extract the desired information from the amino acid sequence of P (cf. Eq. (3)). The form of Eq. (6) can cover all the aforementioned modes of PseAAC. For example, whenwe immediately obtain the formulation of PseAAC originally introduced in Chou (2001), where the meanings for w, θ
, and λ were clearly elaborated and hence there is no need to repeat here. Whenwe obtain the formulation for the amphiphilic PseAAC (Chou, 2005a), where the meanings of w, τ
, and λ were also clearly given.It is instructive to point out that, with the general formulation of Eq. (6), the PseAAC can be used to reflect much more essential core features deeply hidden in complicated protein sequences through the following modes.
Functional domain mode
The functional domain (FunD) is the core of a protein. Therefore, in determining the 3-D (dimensional) structure of a protein by experiments (see, e.g., Call et al., 2010, Pielak and Chou, 2010, Schnell and Chou, 2008, Wang et al., 2009) or by computational modeling (see, e.g., Chou, 2004a, Chou, 2004b), the first priority was always focused on its FunD.Using the FunD information to formulate protein samples was originally proposed in Cai et al. (2003) and Chou and Cai (2002) based on the 2005 FunDs in the SBASE-A database (Murvai et al., 2001). Since then, a series of new protein FunD databases were established, such as COG (Tatusov et al., 2003), KOG (Tatusov et al., 2003), SMART (Letunic et al., 2006), Pfam (Finn et al., 2006), and CDD (Marchler-Bauer et al., 2007). Of these databases, CDD contains the domains imported from COG, Pfam, and SMART, and hence is relatively much more complete (Marchler-Bauer et al., 2007) and was adopted in most of the recent publications (see, e.g., Chou and Shen, 2010a, Chou and Shen, 2010c, Shen and Chou, 2009d). The version 2.11 of CDD contains 17,402 characteristic domains. Thus, when using the general formulation of PseAAC (Eq. (6)) to incorporate the FunD information, we have Ω=17,402, i.e.where T has the same meaning as in Eq. (4), andFor the detailed procedure of how to find the hit for P in CDD, refer to Chou and Shen (2010a).Similar approaches of representing protein samples with the FunD mode were also used for predicting protein subcellular localization (Chou and Cai, 2002, Chou and Cai, 2004d), membrane protein types (Cai and Chou, 2006, Cai et al., 2003), enzyme functional classes (Shen and Chou, 2007a), protease types (Chou and Shen, 2008a, Shen and Chou, 2009c), GPCRs types (Xiao et al., 2010b), protein structural class (Chou and Cai, 2004b), protein fold pattern (Shen and Chou, 2009a), and protein quaternary structural attributes (Shen and Chou, 2009b, Xiao et al., 2009a, Xiao et al., 2010a).
Gene ontology mode
Gene ontology (GO) database (Ashburner et al., 2000) was established according to the molecular function, biological process, and cellular component. Accordingly, protein samples defined in a GO database space would be clustered in a way better reflecting some of their important attributes, such as subcellular localization and biological function (Chou and Shen, 2007c, Chou and Shen, 2008b).The GO database (version 70.0 released 10 March 2008) contains 60,020 GO numbers. Thus, when using the general formulation of PseAAC to incorporate the GO information, we have Ω=60,020, i.e.whereFor the detailed procedure of how to find the hit for P in the GO database, refer to Chou and Shen (2010a).The information extracted from the GO database (Ashburner et al., 2000, Camon et al., 2004, Harris et al., 2004) was used to formulate PseAAC for predicting protein subcellular localization (Cai and Chou, 2003, Chou and Cai, 2003b, Chou and Cai, 2004d, Chou and Shen, 2006a, Chou and Shen, 2006b, Chou and Shen, 2006c, Chou and Shen, 2007a, Chou and Shen, 2007b, Chou and Shen, 2007c, Chou and Shen, 2008b, Lee et al., 2005, Shen and Chou, 2007b, Shen and Chou, 2007c, Shen and Chou, 2007d, Shen et al., 2007), enzyme functional class (Chou and Cai, 2004a, Chou and Cai, 2004c), membrane protein types (Chou and Cai, 2005), protease types (Zhou and Cai, 2006), and protein–protein interactions (Chou and Cai, 2006).
Sequential evolution mode
Biology is a natural science with historic dimension. All biological species have developed continuously starting out from a very limited number of ancestral species. It is true for protein sequence as well (Chou, 2004b). Their evolution involves changes of single residues, insertions, and deletions of several residues (Chou, 1995b), gene doubling, and gene fusion. With these changes accumulated for a long period of time, many similarities between initial and resultant amino acid sequences are gradually eliminated, but the corresponding proteins may still share many common attributes, such as having basically the same biological function and residing in the same subcellular location.The general formulation of PseAAC can be used to incorporate this kind of information via its sequential evolution mode, i.e.wherewhere λ is an uncertain number that will be further discussed later, L is the length of P (counted in the total number of its constituent amino acids), and E
represents the score of the amino acid residue in the i-th position of the protein sequence being changed to amino acid type j during the evolutionary process (Schaffer et al., 2001), which can be derived by using PSI-BLAST (Schaffer et al., 2001) to search the Swiss-Prot database as described in Chou and Shen (2010c). Here, the numerical codes 1, 2,…,20 are used to denote the 20 native amino acid types according to the alphabetical order of their single character codes.The above equations were used to identify membrane proteins and their types (Chou and Shen, 2007d), enzymes and their functional classes (Shen and Chou, 2007a), proteases and their types (Chou and Shen, 2008a), protein quaternary structural attributes (Shen and Chou, 2009b), as well as protein subcellular localization (Chou and Shen, 2010a, Chou and Shen, 2010b).Besides the aforementioned PseAAC modes, there may be some other feature extraction methods to represent protein samples, but they can always be formulated with the form of Eq. (6), the general formulation of PseAAC.It is instructive to point out that, regardless of which kind of PseAAC mode is adopted for protein samples, the query proteins and the proteins used to train the prediction engine must be defined in the same infrastructural frame with exactly the same dimension. For instance, if a query protein is defined in the 17402-D FunD space (see Eq. (9)), then the prediction should be carried out based on those proteins in the training set that can be defined in the exactly same 17402-D FunD space as well. If a query protein is defined in the 60020-D GO space (see Eq. (11)), then the prediction should be carried out based on those proteins in the training set that can be defined in the exactly same 60020-D GO space as well. If the query protein in both the 17402-D FunD space and 60020-D GO space is a naught vector and hence must be defined instead in the sequential evolution space (see Eq. (13)), then all the proteins used to train the prediction engine must also be formulated in the same sequential evolution space. It is particularly important to follow such a self-consistency principle when hybridizing different PseAAC modes or building an ensemble classifier by fusing many individual classifiers (Chou and Shen, 2006d).
Prediction algorithm (operating engine)
The problem of predicting protein attributes can be generally described as follows. Suppose a system containing N proteins (P
1,P
2,…,P
), which have been classified into M subsets (categories) as formulated by Eq. (1), where each subset S
(m=1,2,…,M) is composed of proteins with the same attribute category and its size (the number of proteins therein) is N
. Obviously, we have N=N
1+N
2+⋯+N
. According to Eq. (6), we can suppose without losing generality that the k-th protein in the subset S
(see Eq. (1)) is expressed bywhere is the j-th component of the k-th protein in . Now, for a query protein P as defined by Eq. (6), how can we identify which subset it belongs to?Many different prediction algorithms have been introduced to address this problem, such as discriminant algorithm (Chou and Maggiora, 1998, Chou and Elrod, 1999), neural network algorithm (Cai et al., 2000, Cai et al., 2001), support vector machine (SVM) (Cai et al., 2003, Cai et al., 2004, Chou and Cai, 2002), and K-nearest Neighbor algorithm (Cai and Chou, 2003, Chou and Shen, 2006b). In this paper we shall focus on the K-nearest neighbor algorithm (Denoeux, 1995) and show how to generate a powerful ensemble classifier by fusing many individual basic classifiers characterized with different control parameters.The K-nearest neighbor (KNN) classifier is quite popular in pattern recognition community owing to its good performance and simple-to-use feature. According to the KNN rule (Denoeux, 1995, Keller et al., 1985), named also as the “voting KNN rule”, the query protein should be assigned to the subset represented by a majority of its K nearest neighbors, as illustrated in
Fig. 5
Fig. 5
Illustration to show how the KNN classifier depends on the selection of parameter K in identifying the attribute category of a query protein, where the query protein P is represented by the character q with a filled circle, proteins belonging to subset (category 1) are represented by the open circle with number 1, proteins of by the open circle with number 2, and so forth. When K=1, the query protein is predicted belonging to category 2 as its nearest protein does; when K=3, the query protein is predicted belonging to category 3 because two of its three nearest proteins belong to that category; when K=9, the query protein is predicted belonging to category 2 again because the majority of its nine nearest proteins belong to category 2.
Illustration to show how the KNN classifier depends on the selection of parameter K in identifying the attribute category of a query protein, where the query protein P is represented by the character q with a filled circle, proteins belonging to subset (category 1) are represented by the open circle with number 1, proteins of by the open circle with number 2, and so forth. When K=1, the query protein is predicted belonging to category 2 as its nearest protein does; when K=3, the query protein is predicted belonging to category 3 because two of its three nearest proteins belong to that category; when K=9, the query protein is predicted belonging to category 2 again because the majority of its nine nearest proteins belong to category 2.There are many different definitions to measure the “nearness” for the KNN classifier, such as Euclidean distance, Hamming distance (Mardia et al., 1979), and Mahalanobis distance (Chou, 1995a, Mahalanobis, 1936, Pillai, 1985). Usually, the following equation was adopted to measure the nearness between proteins P and (cf. Eqs. (6), (15)):where is the dot product of the two vectors, and ‖P‖ and their modulus, respectively. According to Eq. (16), when we have , indicating the “distance” between these two proteins is zero and hence they have perfect or 100% similarity. In using the KNN rule, the predicted result will depend on the selection of the parameter K, the number of the nearest neighbors to the query protein P, as described below.
Nearest neighbor classifier
The nearest neighbor classifier (Cover and Hart, 1967), also called NN classifier, is a special case of KNN classifier with K=1 (Fig. 5). With the NN classifier, the protein P will be predicted belonging to the same attribute category of the protein in the learning dataset that has the shortest “distance” to P, i.e., the query protein will be classified in the μ-th attribute category ifwhere means taking the minimum value of for the proteins in the subset (cf. Eqs. (1) and (16)), and the operator arg min means taking the argument of m that minimizes the quantity right after the operator. In other words, μ in Eq. (17) is equal to the argument of m that minimizes . If there are two and more arguments leading to the same minimum value, the query protein will be randomly assigned to one of the subsets associated with these arguments although this kind of tie case rarely happens. Owing to its simplicity and apparent efficiency, the NN classifier is still a favorite method used by many investigators (see, e.g., Chen et al., 2010, He et al., 2010, Huang et al., 2010).
KNN classifier
With the KNN classifier when K>1, the attribute of the query protein P will be determined by the majority of its K nearest neighbors via a vote (Fig. 5), as can be formulated as follows. Suppose are the K proteins in that have the closest distances to P, the query protein will be predicted belonging to the μ-th subset (attribute category) ifwhere μ is the argument of m that maximize andwhere ∈ is a symbol in the set theory meaning “member of”. If there is a tie for the voting results, the query protein will be randomly assigned to one of the locations associated with the tie case. Generally speaking, the greater the K (the number of the nearest neighbors counted), the less likely the tie case occurs.As mentioned above, the sequential evolution PseAAC mode of Eq. (13) contains a parameter λ, which is associated with what tier of sequence correlation is taken into account for the PseAAC. As we can see from Eq. (14), the only constraint to λ is that it must be smaller than L, the number of the amino acids in the protein concerned. Suppose the length of the shortest protein investigated is 50, then λ can be any of the following 50 numbers: 0, 1, 2,…,49. Although in principle we can include all these possibilities for λ by enlarging the dimension of the PseAAC to contain 20×50=1000 components, it may cause various unfavorable problems for statistical prediction, such as “high dimension disaster” and “overfitting redundancy” (Wang et al., 2008a). Actually, it may reduce the cluster-tolerant capacity (Chou, 1999) and lower down the success rate of cross-validation if the PseAAC contains too many trivial components. Accordingly, for a given training dataset, there is an optimal number for λ. However, it would be time-consuming and tedious to find the optimal λ by changing its value and doing tests one-by-one.Likewise, the KNN classifier (cf. Eq. (18)) also contains a parameter K, the number of the nearest neighbors to a query protein (Fig. 5). It will affect the predicted result by choosing a different value for K. In other words, for a given training dataset, there is an optimal value for K as well.The parameters such as λ and K are called uncertain parameters. The number of the uncertain parameters depends on which model is used to represent the protein samples and what classifier is used for the prediction engine. It can be seen from Eqs. (9), (11), (13), and (18) that one uncertain parameter, K, needs to be determined if using KNN classifier based on the FunD (or GO) mode of PseAAC, and that two uncertain parameters, K and λ, need to be determined if using KNN classifier based on the sequential evolution mode. It would be much more tedious and time-consuming to determine the optimal values for two uncertain parameters. To deal with this kind of uncertain parameters, let us introduce the fusion approach.
One-dimensional fusion
For most cases in using the KNN classifier to predict protein attributes, when K>20, the success rate by the KNN classifier would decrease remarkably. Therefore, the basic individual classifiers to be considered can be generally expressed aswhere represents the KNN classifier that is a function of K, the symbol is the identification operator meaning using to identify the attribute of the query protein P among the M subsets of in Eq. (1). Suppose the accumulated score thus obtained (with K=1,2,…,20) for the protein P belonging to the m-th subset is given bywhereThus the query protein P is predicted belonging to the subset with which its score of Eq. (21) is the highest, i.e., the query protein P is identified as belonging to the μ-th subset ifwhere μ is the argument of m that maximizes the score function of Eq. (21). If there are two and more arguments leading to the same maximum value, the query protein will be randomly assigned to one of the subset associated with these arguments although this kind of tie case rarely happens.
Two-dimensional fusion
When the KNN classifier is operated on the query protein formulated with the sequential evolution mode (cf. Eq. (13)), we are facing a problem with two uncertain parameters, K and λ. In general, the shortest protein sequence investigated is 50 amino acids (Chou and Shen, 2008a, Chou and Shen, 2010c), hence we can set the maximum value allowed for λ is 49. Thus, the basic individual classifiers to be considered would become as follows:and the corresponding accumulated score for the query protein belonging to the m-th subset is given bywhereand the query protein is predicted belonging to the subset with which its score of Eq. (25) is the highest, i.e., the query protein P is identified as belonging to the μ-th subset ifwhere μ is the argument of m that maximizes the score function of Eq. (25). If there are two and more arguments leading to the same maximum value, the query protein will be randomly assigned to one of the subcellular locations associated with these arguments although this kind of tie case rarely happens.If a basic individual classifier involves with three or more uncertain parameters, by following the similar procedures as described above, we can perform three or higher dimensional fusion.
Cross-validation test
After a prediction method has been developed, a subsequent and natural question to ask is: What is its accuracy?In statistical prediction, it would be meaningless to simply say a success rate of a predictor without specifying what cross-validation method and benchmark dataset were used to test its accuracy. In literatures, the following three cross-validation methods are generally used for examining the effectiveness of a statistical prediction method: (1) the independent dataset test, (2) the subsampling (Γ-fold such as 5- or 10-fold cross-validation) test, and (3) the jackknife test (Chou and Zhang, 1995).For the independent dataset test, although all the proteins used to test the predictor are outside the training dataset used to train it so as to exclude the “memory” effect or bias, the way of how to select the independent proteins to test the predictor could be quite arbitrary unless the number of independent proteins is sufficiently large. This kind of arbitrariness might result in completely different conclusions. For instance, a predictor achieving a higher success rate than the other predictor for a given independent testing dataset might fail to keep so when tested by another independent testing dataset (Chou and Zhang, 1995). Accordingly, the independent dataset test is not a fairly objective test method although it was often used to demonstrate the practical application of a predictor (see, e.g., Cedano et al., 1997, Chou and Elrod, 1999, Chou and Shen, 2006c, Chou and Shen, 2007a).For the subsampling test, the concrete procedure usually used in literatures is the 5-fold, 7-fold, or 10-fold cross-validation. The problem with the Γ-fold cross-validation test as such is that the number of possible selections in dividing a benchmark dataset is an astronomical figure even for a very simple dataset. This is because for a benchmark dataset as formulated in Eq. (1), the number of possible combinations of taking one Γ-th or 1/Γ proteins from each of the subsets in Eq. (1) will bewherewhere N
is the number of proteins in the m-th subset , and the symbol Int is the integer-truncating operator meaning to take the integer part for the number in the brackets right after it.For example, without losing generality let us consider the case of 5-fold cross-validation (i.e., Γ=5) for a very simple benchmark dataset that contains 250 proteins, of which N
1=65 belongs to subset , N
2=60 to subset , N
3=55 to subset , and N
4=70 to subset . Substituting these figures into Eqs. (28), (29), we have that the number of possible combinations of taking one-fifth proteins from each of the four subsets will beindicating that for such a simple and small benchmark dataset, the number of possible combinations of taking one-fifth proteins from each of the four subsets for 5-fold cross-validation will be an astronomical number.Now let us consider a moderate-size dataset that consists of 640 proteins classified into M=8 subsets with each containing 80 proteins, i.e., N
1=N
2=⋯=N
8=80. According to Eqs. (28), (29), the number of possible combinations of taking one-fifth proteins from each of the 8 subsets for 5-fold-cross-validation will beIf the above benchmark dataset is slightly larger and complicated, i.e., the number of proteins is increased from 640 to 800, and the number of subsets from 8 to 10 with each still containing 80 proteins, then the number of possible combinations of taking one-fifth proteins from each of the 10 subsets for 5-fold-cross-validation will beActually, many typical benchmark datasets contain more than 1000 proteins (see, e.g., Chou and Shen, 2008a, Chou and Shen, 2010a, Chou and Shen, 2010c). Therefore, in any actual subsampling cross-validation tests, only an extremely small fraction of the possible selections are taken into account. Since different selections will always lead to different results even for a same benchmark dataset and a same predictor, the subsampling test (such as 5-fold cross-validation) cannot avoid the arbitrariness either. A test method unable to yield a unique outcome cannot be deemed as an ideal one.In the jackknife test, all the proteins in the benchmark dataset will be singled out one-by-one and tested by the predictor trained by the remaining protein samples. During the process of jackknifing, both the training dataset and testing dataset are actually open, and each protein sample will be in turn moved between the two. The jackknife test can exclude the “memory” effect. Also, the arbitrariness problem as mentioned above for the independent dataset test and subsampling test can be avoided because the outcome obtained by the jackknife cross-validation is always unique for a given benchmark dataset. As for the possible overestimation in success rate by jackknife test because of only one sample being singled out at a time for testing, the answer is that as long as the jackknife test is performed on a stringent benchmark dataset in which none of proteins has ≥25% pairwise sequence identity to any other in a same subset such as those mentioned in the Section 2, it is highly unlikely to yield an overestimated rate compared with the actual success rate in practical applications, as demonstrated in Chou and Shen ( 2010c) and Shen and Chou (2010). Besides, when the jackknife test was used to compare two predictors, even if there was some overestimate due to using a less stringent benchmark dataset for one predictor, the same overestimate would exist for the other as long as they were both tested by the same dataset.Accordingly, the jackknife test has been increasingly and widely used by investigators to examine the quality of various predictors (see, e.g., Anand and Suganthan, 2009, Cai et al., 2010, Chen et al., 2008a, Chen et al., 2008b, Chen and Han, 2009, Du and Li, 2008, Du et al., 2009, Fang et al., 2008, Feng and Luo, 2008, Gu and Chen, 2009, Gu et al., 2010a, Jahandideh et al., 2007a, Jahandideh et al., 2007b, Jahandideh et al., 2009, Ji et al., 2010, Kannan et al., 2008, Li et al., 2009, Lin, 2008, Lin et al., 2009a, Liu et al., 2010a, Munteanu et al., 2008, Nanni and Lumini, 2008, Nanni and Lumini, 2009, Rezaei et al., 2008, Shao et al., 2009, Shi et al., 2008, Shi and Hu, 2010, Vilar et al., 2009, Wang and Yang, 2010, Wang et al., 2010a, Wang et al., 2008b, Yang and Jiang, 2010, Yang et al., 2009, Yang et al., 2010, Zhao et al., 2008, Zhou et al., 2008).However, even if using the jackknife approach for cross-validation, the same predictor may still generate obviously different success rates when tested by different benchmark datasets. This is because the more the stringent of a benchmark dataset in excluding homologous and high similarity sequences, the more the difficult for a predictor to achieve a high overall success rate (Chou and Shen, 2010a). Also, the more the number of subsets (attribute categories) a benchmark dataset covers, the more the difficult to achieve a high overall success rate. This can be easily conceivable via the following consideration. Suppose a benchmark dataset consists of two subsets (attribute categories) with each containing the same number of proteins. The overall success rate in identifying their attribute categories by random assignment would be 1/2=50%. However, for a benchmark dataset consisting of 20 subsets, the corresponding overall success rate by the random assignment would be 1/20=5%, which is only one-tenth of the former.
Web-server
Even if a powerful predictor has been developed by accomplishing the above four procedures, namely constructing a valid benchmark dataset, formulating protein samples with PseAAC to successfully catch their essential and core features, introducing a powerful and efficient algorithm or engine to operate the prediction, and achieving a high overall success rate by jackknife test on a stringent dataset in which none of the proteins included has ≥25% pairwise sequence identity to any other in the same subset (attribute category), it does not mean that the predictor has been really completed. This is because we are living in the Internet Age. To make a new prediction method really useful for the majority of people, it is an important direction or necessary procedure to provide a user-friendly and publicly accessible web-server for the method (Chou and Shen, 2009). Technically speaking, a web-server means a computer program that is responsible for accepting Hypertext Transfer Protocol (HTTP) requests from clients. By means of web-servers, many computational prediction methods, regardless how difficult their mathematics or how complicated their algorithms are, can be easily used by the vast majority of scientists to generate their desired data without the need to understand the mathematical details.
Conclusion and perspectives
In order to timely utilize the huge amount of newly discovered protein sequences generated in the postgenomic era for basic research and drug development, scientists are anxious to know their biological attributes. Many studies from various research laboratories around the world have indicated that mathematical analysis, computational modeling, and introducing novel physical concept to biology and medicine, such as graphical analysis (Andraos, 2008, Myers and Palmer, 1985, Zhou and Deng, 1984), modeling three-dimensional structures of targeted proteins/peptides for drug design (Sharma et al., 2008, Zhou and Troy, 2003, Zhou and Troy, 2005a, Zhou and Troy, 2005b, Zhou et al., 2004), diffusion-controlled reaction simulation (Zhou et al., 1981, Zhou and Zhong, 1982, Zhou et al., 1983), cellular responding kinetics (Qi et al., 2007), and biological functions of solitons in DNA (Zhou, 1989) can provide useful insights for both basic research and drug design and hence are widely welcome by science community. In view of this, it is highly desirable to develop automated methods by introducing new concepts and approaches for fast and accurately predicting the attributes of uncharacterized proteins based on their sequence information alone. During the past two decades or so, many statistical methods for predicting various protein attributes have been proposed. In this review, the key steps for establishing a powerful predictor in this regard have been analyzed in hopes that the points raised here may help stimulate the further development of new and more powerful predictors in this area. It is anticipated that the general form of PseAAC as formulated in this review may further stimulate the efforts to find various new modes of optimal PseAAC, which is one of the most important future directions we should focus on in order to substantially improve the power of predicting protein attributes.
Authors: Mariana Fioramonte; Aline Mara dos Santos; Sean McIlwain; William S Noble; Kleber G Franchini; Fabio C Gozzo Journal: Proteomics Date: 2012-08 Impact factor: 3.984