Literature DB >> 15585667

Predicting functional family of novel enzymes irrespective of sequence similarity: a statistical learning approach.

L Y Han¹, C Z Cai, Z L Ji, Z W Cao, J Cui, Y Z Chen.

Abstract

The function of a protein that has no sequence homolog of known function is difficult to assign on the basis of sequence similarity. The same problem may arise for homologous proteins of different functions if one is newly discovered and the other is the only known protein of similar sequence. It is desirable to explore methods that are not based on sequence similarity. One approach is to assign functional family of a protein to provide useful hint about its function. Several groups have employed a statistical learning method, support vector machines (SVMs), for predicting protein functional family directly from sequence irrespective of sequence similarity. These studies showed that SVM prediction accuracy is at a level useful for functional family assignment. But its capability for assignment of distantly related proteins and homologous proteins of different functions has not been critically and adequately assessed. Here SVM is tested for functional family assignment of two groups of enzymes. One consists of 50 enzymes that have no homolog of known function from PSI-BLAST search of protein databases. The other contains eight pairs of homologous enzymes of different families. SVM correctly assigns 72% of the enzymes in the first group and 62% of the enzyme pairs in the second group, suggesting that it is potentially useful for facilitating functional study of novel proteins. A web version of our software, SVMProt, is accessible at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi.

Mesh：

Substances：
Enzymes

Year: 2004 PMID： 15585667 PMCID： PMC535691 DOI： 10.1093/nar/gkh984

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Protein functional assignment has been conducted primarily by sequence similarity, clustering and pattern identification methods (1–7). These methods tend to become less effective for novel proteins that have no homolog or whose homolog is of different function (4,5,7,8). Genomes are known to contain a substantial portion of such novel proteins. For instance, 20–100% of the unknown putative protein-coding open reading frames in a number of recently sequenced viral genomes (9–12) are without a single homolog in Swiss-Prot database (13) based on PSI-BLAST search of that database as of September 2004. Hence, there is a need for exploring other functional prediction methods (14,15). Alternative approaches have been developed that explore structural features (16,17), interaction profiles (18,19), protein/gene fusion data (20,21) and functional family assignment by using statistical learning methods including discretized naïve Bayes, C4.5 decision trees, and instance-based leaning (22), neural networks (23) and support vector machines (SVMs) (22,24–29). In particular, the possibility of using SVM for functional family assignment of distantly related proteins and homologous proteins of different functions has been raised based on testing results of a relatively small number of such proteins (25,27). However, the proteins used in these studies were selected based on BLAST instead of PSI-BLAST results. PSI-BLAST (30) is known to be significantly more sensitive to proteins of weak similarities than BLAST (1). Therefore, proteins selected based on PSI-BLAST results can, in a more critical manner, better test the capability of SVM functional classification of distantly related proteins, particularly those whose function cannot be assigned by sequence alignment and clustering methods. Moreover, the number of proteins used in earlier studies is relatively small, which may not be sufficient for testing the performance of SVM assignment of functional family of novel proteins. In this work, two groups of enzymes, obtained from unbiased search of protein databases and literatures and subsequently verified by PSI-BLAST, are used to assess the capability of SVM for predicting the functional family of novel proteins. One group includes enzymes that are without a homolog in the protein databases based on PSI-BLAST search of these databases. A similarity E-value threshold of 0.05 is used for homolog searching to ensure maximum exclusion of enzymes that have a homolog. The second group contains pairs of homologous enzymes of different families. A stricter similarity E-value threshold of 10−6 is used for selecting these enzyme pairs to ensure minimum inclusion of non-homologous pairs. In the hypothetical situation that one enzyme in a pair of homologous enzymes of different families is newly discovered and the other is the only known protein of similar sequence, the function of the first enzyme can be incorrectly assigned to that of the second enzyme by using sequence similarity methods. Thus, it is of interest to examine to what extent SVM can be used as an alternative approach for facilitating functional assignment for these enzymes. These two groups of enzymes are further checked to remove those that are in the SVM training sets. SVM is based on the structural risk minimization principle from statistical learning theory (31). For each protein functional family, it constructs a hyperplane either in an input space or a higher-dimensional hyper-space to maximally separate two groups of proteins, one group is composed of members and the other is composed of non-members of that family. Proteins in a training set, represented by their sequence-derived physicochemical properties, are projected onto this hyperspace where members of a family are separated from the non-members by a hyperplane whose parameters are adjusted by using a testing set of proteins. By projecting a new sequence onto the hyperspace, this SVM system can be used to determine whether it is a member of that family based on its location with respect to the hyperplane. SVM classifies proteins into functional families defined from activities and physicochemical properties rather than sequence similarity (19,22,24,25,27,28,32). These families are composed of multiple homolog groups and some distantly related proteins. The accuracy of SVM depends on the diversity of the protein samples, the quality of the representation of protein properties, and the efficiency of the statistical learning algorithm. To a certain extent, no sequence similarity is required per se. Thus SVM is an attractive approach for facilitating the functional assignment of novel proteins.

METHODS

SVM protein functional family assignment system is developed in the following manner. First, every protein sequence is represented by specific feature vector assembled from encoded representations of tabulated residue properties including amino acid composition, hydrophobicity, normalized Van der Waals volume, polarity, polarizability, charge, surface tension, secondary structure and solvent accessibility for each residue in the sequence (19,22,24,25,32–34). Similar types of features have been successfully used for predicting enzyme functional (22) and structural classes (22,32) by using statistical learning methods. Amino acid composition can be straightforwardly computed. Methods for computing each of the other properties can be found from the literature (19,24,25,33,34). For each of these properties, amino acids are divided into three groups such that those in a particular group are regarded to have the same property. For instance, amino acids can be divided into hydrophobic (CVLIMFW), neutral (GASTPHY) and polar (RKEDQN) groups. The groupings of amino acids for each of the properties are given in Table 1. Three descriptors, composition (C), transition (T) and distribution (D), are used to describe global composition of each of the properties. C is the number of amino acids of a particular property (such as hydrophobicity) divided by the total number of amino acids in a protein sequence. T characterizes the percent frequency with which amino acids of a particular property is followed by amino acids of a different property. D measures the chain length within which the first, 25, 50, 75 and 100% of the amino acids of a particular property is located respectively.

Table 1.

Division of amino acids into three different groups for different physicochemical properties

Property	Group 1	Group 2	Group 3
Hydrophobicity
Type	Polar	Neutral	Hydrophobic
Amino acids in the group	RKEDQN	GASTPHY	CVLIMFW
Van der Waals volume
Value	0–2.78	2.95–4.0	4.43–8.08
Amino acids in the group	GASCTPD	NVEQIL	MHKFRYW
Polarity
Value	4.9–6.2	8.0–9.2	10.4–13.0
Amino acids in the group	LIFWCMVY	PATGS	HQRKNED
Polarizability
Value	0–0.108	0.128–0.186	0.219–0.409
Amino acids	GASDT	CPNVEQIL	KMHFRYW

A hypothetical protein sequence AEAAAEAEEAAAAAEAEEEAAEEAEEEAAE, as shown in Figure 1, has 16 alanines (n1 = 16) and 14 glutamic acids (n2 = 14). The compositions for these two amino acids are n1 × 100.00/(n1 + n2) = 53.33 and n2 × 100.00/(n1 + n2) = 46.67, respectively. There are 15 transitions from A to E or from E to A in this sequence and the percent frequency of these transitions is (15/29) × 100.00 = 51.72. The first, 25, 50, 75 and 100% of As are located within the first 1, 5, 12, 20, and 29 residues, respectively. The D descriptor for As is thus 1/30 × 100.00 = 3.33, 5/30 × 100.00 = 16.67, 12/30 × 100.00 = 40.0, 20/30 × 100.00 = 66.67, 29/30 × 100.00 = 96.67. Likewise, the D descriptor for Es is 6.67, 26.67, 60.0, 76.67, 100.0. Overall, the amino acid composition descriptors for this sequence are C = (53.33, 46.67), T = (51.72), and D = (3.33, 16.67, 40.0, 66.67, 96.67, 6.67, 26.67, 60.0, 76.67, 100.0), respectively. Descriptors for other properties can be computed by a similar procedure.

Figure 1

The sequence of a hypothetic protein for illustration of derivation of the feature vector of a protein. Sequence index indicates the position of an amino acid in the sequence. The index for each type of amino acids in the sequence (A or E) indicates the position of the first, second, third, … of that type of amino acid (the position of the first, second, third, …, A is at 1, 3, 4, …). A/E transition indicates the position of AE or EA pairs in the sequence.

Overall, there are 21 elements representing these three descriptors: 3 for C, 3 for T and 15 for D (19,25). The feature vector of a protein is constructed by combining the 21 elements of all of these properties and the 20 elements of amino acid composition in sequential order. Table 2 gives the computed descriptors of the human insulin precursor (Swiss-Prot accession no. P01308). The feature vector of a protein is constructed by combining all of the descriptors in sequential order.

Table 2.

Characteristic descriptors of human insulin precursor (Swiss-Prot AC P01308)

Property	Elements of descriptors
Amino acid composition	9.09	5.45	1.82	7.27	2.73	10.91	1.82	1.82	1.82	18.18
	1.82	2.73	5.45	6.36	4.55	4.55	2.73	5.45	1.82	3.64
Hydrophobicity	24.55	38.18	37.27	15.60	16.51	30.28	5.45	40.91	54.55	80.00
	100.0	1.82	21.82	47.27	68.18	98.18	0.91	12.73	37.27	72.37
	99.09
Van der waals volume	40.00	41.82	18.18	29.36	11.01	13.76	1.82	21.82	52.73	71.82
	99.09	2.73	25.45	56.36	78.18	100.0	0.91	15.45	41.82	50.00
	98.18
Polarity	40.91	32.73	26.36	24.77	20.18	13.76	0.91	14.55	38.18	74.55
	99.09	1.82	20.91	49.09	68.18	91.82	5.45	33.64	53.64	79.09
	100.0
Polarizability	29.09	52.73	18.18	31.19	9.17	15.60	1.82	21.82	52.73	68.18
	91.82	2.73	25.45	56.36	79.09	100.0	0.91	15.45	41.82	50.00
	98.18

The feature vector of this protein is constructed by combining all of the descriptors in sequential order.

SVM is then trained by using representative proteins of a particular functional family (positive samples) and those that are outside this family (negative samples). The positive samples of a family include all of the known distinct proteins in that family. Because of the enormous number of proteins, the size of negative samples needs to be restricted to a manageable level by using a minimum set of representative proteins. One way for choosing representative proteins is to select one or a few distinct proteins from each protein domain family. The negative samples of a family can be selected from seed proteins of the 7316 curated protein families (domain-based) in the Pfam database (35) excluding those families that have at least one member belonging to the functional class. Pfam families are constructed on the basis of sequence similarity. The purpose of using Pfam proteins is to ensure that the negative samples are evenly distributed in the protein space. Sequence similarity is not required for selecting positive samples. In this sense, SVMProt is to some extent independent of sequence similarity. The theory of SVM has been described in the literature (19,24,25,33,34). Thus only a brief description is given here. In linearly separable cases, SVM constructs a hyperplane that separates two different groups of feature vectors with a maximum margin. A feature vector is represented by x, with physicochemical descriptors of a protein as its components. The hyperplane is constructed by finding another vector w and a parameter b that minimizes‖w‖ and satisfies the following conditions: where y is the group index, w is a vector normal to the hyperplane, |b|/‖w‖ is the perpendicular distance from the hyperplane to the origin and ‖w‖ is the Euclidean norm of w. After the determination of w and b, a given vector x can be classified by In nonlinearly separable cases, SVM maps feature vectors into a high dimensional feature space using a kernel function K(x, x). An example of a kernel function is the Gaussian kernel, which has been extensively used in a number of protein classification studies (19,24,26,31,33,34,36): The linear SVM procedure is then applied to the feature vectors in this feature space and the decision function for their classification is given by where the coefficients α0 and b are determined by maximizing the following Langrangian expression: under conditions, A positive or negative value from Equation 3 or 5 indicates that the vector x belongs to the positive or negative group, respectively. To further reduce the complexity of parameter selection, hard margin SVM with threshold instead of soft margin SVM with threshold is used in SVMProt. Scoring of SVM classification of proteins has been estimated by a reliability index and its usefulness has been demonstrated by statistical analysis (34). A slightly modified reliability score, R-value, is used in SVMProt: where d is the distance between the position of the vector of a classified protein and the optimal separating hyperplane in the hyperspace. There is a statistical correlation between R-value and expected classification accuracy (probability of correct classification) (34). Thus another quantity, P-value, is introduced to indicate the expected classification accuracy. P-value is derived from the statistical relationship between the R-value and actual classification accuracy based on the analysis of 9932 positive and 45 999 negative samples of proteins (25).

RESULTS AND DISCUSSION

The protein functional family prediction system SVMProt is improved by using training sets of a significantly larger number of proteins than that reported earlier (25,27). The training and testing sets consist of 49 975 representative enzymes from 46 functional families obtained from UniProt version 1.6, and 243 152 non-enzyme representative proteins from 7316 Pfam curated protein families (35). Enzyme functional families are the International Commission (EC) classes (37) up to the second level (from EC1.1 to EC6.5). The procedure for selecting positive samples of a family is as follows. First, all members of this family in UniProt 1.6 are collected and subsequently mapped into the original feature space which is divided into small grid blocks, then one or a few distinct enzymes are selected from those distributed in each of these blocks are selected as the training set of that family. Enzymes in the testing and independent sets are randomly selected from the remaining pool of family members. The negative samples of a family are selected from representative proteins of Pfam families that are non-enzymes or enzymes of other enzyme families. The statistics of the datasets and the prediction results as well as SVMProt can be accessed at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi. An independent set of 13 891 enzymes and 122 710 non-enzymes are used to assess the capability of SVM for assignment of enzymes into their respective family (sensitivity) and for assignment of non-member proteins outside that family (specificity). The sensitivity is >85% for 9 families, 70–85% for 21 families, 60–70% for 10 families and 53–60% for 6 families. The specificity is >95% for 38 families and 82–95% for 8 families. The overall sensitivity for all of the 13 891 enzymes is 86%, which is improved against the accuracy of 68% for the assignment of 14 709 enzymes into their respective EC second level class by using one or more of the three statistical learning methods discretized naïve Bayes, C4.5 decision trees, and instance-based leaning (22). SVM has also been used for classification of enzymes into structural families irrespective of sequence similarity, and the accuracy for assignment of 1178 enzymes is 80% (32). These suggest that statistical learning methods are useful for functional and structural family assignment. The overall sensitivity is however slightly lower than that of 92% for the BLAST assignment of the EC class of 12 900 enzymes (38). Non-the-less, as these are to a certain extent independent of sequence similarity, statistical learning methods such as SVM are useful alternative for studying novel proteins whose function cannot be assigned on the basis of sequence similarity. Enzymes without a homolog of known function are searched from the Swiss-Prot database (13) by using the key word ‘novel’, ‘distinct’, or ‘unrelated’ combined with ‘enzyme’. The next step is to eliminate those with at least one homolog of known function (except for hypothetical proteins) by conducting a PSI-BLAST (1) search against the NR databases that include all non-redundant GenBank, CDS translations, PDB, Swiss-Prot, PIR and PRF databases. This ensures that only those truly having no homolog in protein databases are selected. While the selected enzymes from this process are without a homolog, their function has been determined experimentally and these have been reported in the literature and subsequently described in the Swiss-Prot database. The last step is to remove those present in the SVMProt training sets. Table 3 gives the 12 enzymes without a homolog in the NR databases (group NR) and additional 38 enzymes without a homolog in the Swiss-Prot database (group SP) selected from this process, none of which are in the SVM training sets; 8 out of 12 (67%) enzymes in group NR and 28 out of 38 (73.7%) enzymes in group SP are correctly assigned to the respective family by SVMProt. The overall accuracy is 72% which is comparable to the average sensitivity for the enzyme families and it is consistent with the sequence-similarity-independent nature of SVM functional assignment. To further facilitate the testing of SVMProt for functional family assignment of novel proteins, a number of proteins of unknown function are selected. These proteins are either without a homolog or without functional indication in Swiss-Prot or NR database as of September 2004 based on PSI-BLAST search. The predicted functional classes of these proteins are given in the Supplementary Material.

Table 3.

List of enzymes without a homolog in the NR and Swiss-Prot databases and the results of SVM functional family assignment

Enzyme (EC number) [Swiss-Prot accession number]	Database containing no homolog	SVM assigned functional family (probability of correct prediction)	Assignment status
Thiocyanate hydrolase beta subunit (EC 3.5.5.8) [O66186].	NR	EC 3.5 Hydrolase of non-peptide carbon–nitrogen bonds (98.9%)	+
		EC 2.6 Transferases of nitrogenous groups (62.2%)
Potential cysteine protease avirulence protein avrPpiC2 (EC 3.4.22.-) [Q9F3T4].	NR	EC 4.2 Carbon–oxygen lyase (93.6%)	−
		EC 2.3 Acyltransferase (83.9%)
		EC 4.1 Carbon–carbon lyase (71.3%)
		Outer membrane (58.6%)
Extracellular phospholipase (EC 3.1.1.5) [P82476]	NR	EC 3.1 Hydrolase of ester bonds (98.7%)	+
Cytochrome c oxidase polypeptide IV, mitochondrial precursor (EC 1.9.3.1) [P30815].	NR	EC 1.9 Oxidoreductase of a heme group of donors (99.0%)	+
Cytochrome c oxidase polypeptide VI (EC 1.9.3.1) [P26310].	NR	EC 1.9 Oxidoreductase of a heme group of donors (98.4%)	+
		Transmembrane (98.3%)
		EC 3.1 Hydrolase of ester bonds (62.2%)
Alginate lyase precursor (EC4.2.2.3) [P39049].	NR	Transmembrane (65.4%)	−
		Outer membrane (58.6%)
		EC 2.1 Transferase of one-carbon groups (58.6%)
DNA α-glucosyltransferase (EC 2.4.1.26) [P04519]	NR	EC 2.4 Glycosyltransferase (80.4%);	+
		EC 2.7 Transferase of phosphorus-containing groups (68.5%)
Endonuclease CviAII (EC 3.1.21.4 [P31117]	NR	EC 3.1 Hydrolase of ester bonds (99.0%)	+
Type II restriction enzyme CviJI (EC 3.1.21.4) [P52283]	NR	EC 3.1 Hydrolase of ester bonds (99.0%);	+
		rRNA-binding proteins (98.8%);
		EC 3.4 Peptidase (68.5%)
DNA-directed RNA polymerase, subunit 10 homolog (EC 2.7.7.6) [P42488]	NR	EC 2.7 Transferase of phosphorus-containing groups (99.0%)	+
		7 transmembrane receptor metabotropic glutamate family (58.6%)
Endonuclease IV (EC 3.1.21.-) [P39250]	NR	No function predicted	−
Beta-agarase precursor (EC3.2.1.81) [P13734].	NR	EC 4.1 Carbon–carbon lyase (96.7%)	−
		EC 2.4 Glycosyltransferase (71.3%)
Phenylacetaldoxime dehydratase (EC 4.2.1.-) [P82604].	Swiss-Prot	Transmembrane (98.2%)	−
		EC 3.4 Peptidase (96.4%)
		EC 3.3 Hydrolase of ether bonds (80.4%)
		EC 2.7 Transferase of phosphorus-containing groups (73.8%)
ATP synthase H chain, mitochondrial precursor (EC3.6.3.14) [Q12349].	Swiss-Prot	EC 3.6 Hydrolase of acid anhydrides (99.0%)	+
		RNA-binding protein (58.6%)
Peptide-N(4)-(N-acetyl-β-d-glucosaminyl)asparagine amidase F precursor (EC 3.5.1.52) [P21163]	Swiss-Prot	EC 3.5 Hydrolase of non-peptide carbon–nitrogen bonds (99.0%)	+
		Beta-Barrel porin (58.6%)
S-Adenosyl-l-methionine hydrolase (EC 3.3.1.2) [P07693]	Swiss-Prot	EC 3.3 Hydrolase of ether bonds (99.0%)	+
		EC 2.7 Transferase of phosphorus-containing groups (71.3%)
		DNA-binding protein (65.4%)
Hypothetical 52.8 kDa protein in VPS15-YMC2 intergenic region. (EC 3.1.22.-) [P38257]	Swiss-Prot	DNA-binding protein (89.3%)	−
		Outer membrane (58.6%)
Hypothetical protein BBB03 (EC3.1.22.-) [O50979].	Swiss-Prot	EC 2.7 Transferase of phosphorus-containing groups (88.1%)	−
		EC 3.4 Peptidase (86.8%)
		EC 2.3 Acyltransferase (71.3%)
		EC 4.1 Carbon–carbon lyase (65.4%)
Telomere elongation protein (EC2.7.7.-) [P17214].	Swiss-Prot	EC 2.7 Transferase of phosphorus-containing groups (99.1%)	+
		DNA-binding protein (78.4%)
Fucose-1-phosphate guanylyltransferase (EC 2.7.7.30) [O14772]	Swiss-Prot	EC 2.7 Transferase of phosphorus-containing groups (99.1%)	+
		7 transmembrane receptor metabotropic glutamate family (58.6%)
DNA-directed RNA polymerase I 14 kDa polypeptide (EC 2.7.7.6) [P50106].	Swiss-Prot	EC 2.7 Transferase of phosphorus-containing groups (99%)	+
		DNA-binding protein (62.2%)
		Beta-Barrel porin (58.6%)
		EC 3.4 Peptidase (58.6%)
DNA polymerase III, theta subunit (EC 2.7.7.7) [P28689].	Swiss-Prot	EC 2.7 Transferase of phosphorus-containing groups (99.0%)	+
		EC 4.2 Carbon–oxygen lyase (58.6%)
Cytochrome c oxidase polypeptide IV (EC 1.9.3.1) [P77921]	Swiss-Prot	EC 1.9 Oxidoreductase of a heme group of donors (97.0%)	+
		Envelope protein (58.6%)
		Transmembrane (58.6%)
Cytochrome c oxidase polypeptide VII (EC 1.9.3.1) [P10174].	Swiss-Prot	EC 1.9 Oxidoreductase of a heme group of donors (98.3%)	+
		Transmembrane (58.6%)
Cytochrome c oxidase polypeptide VIII, mitochondrial precursor (EC 1.9.3.1) [P04039].	Swiss-Prot	EC 1.9 Oxidoreductase of a heme group of donors (99.0%)	+
		Transmembrane (58.6%)
		RNA-binding protein (58.6%)
Cytochrome c oxidase polypeptide VIIA precursor (EC1.9.3.1) [P07255].	Swiss-Prot	EC 1.9 Oxidoreductase of a heme group of donors (97.8%)	+
		Transmembrane (93.8%)
		EC 1.10 Oxidoreductase of diphenols and related substances as donors (58.6%)
		Alpha-type channel (58.6%)
Heme-copper oxidase subunit IV (EC 1.9.3.-) [Q9YDX4].	Swiss-Prot	EC 1.9 Oxidoreductase of a heme group of donors (99.0%)	+
		Transmembrane (99.0%)
Aminoglycoside 2′-N-acetyltransferase (EC 2.3.1.-) [P95219]	Swiss-Prot	EC 2.7 Transferase of phosphorus-containing groups (78.4%)	−
		EC 4.2 Carbon–oxygen lyase (58.6%)
Glycosyl transferase alg8 (EC2.4.1.-) [Q887P9].	Swiss-Prot	Transmembrane (99.0%)	^*
		EC 2.4 Glycosyltransferase (98.6%)
Beta-agarase B (EC 3.2.1.81) [P48840].	Swiss-Prot	Outer membrane (58.6%)	−
		Beta-Barrel porin (58.6%)
CM (EC 5.4.99.5) [P19080]	Swiss-Prot	EC 5.4. Intramolecular transferase (99.0%)	+
		EC 4.2. Carbon–oxygen lyase (58.6%)
		Outer membrane (58.6%)
DNA β-glucosyltransferase (EC 2.4.1.27) [P04547]	Swiss-Prot	EC 2.4 Glycosyltransferases (95.7%);	+
		EC 2.5 Transferase of alkyl or aryl groups, other than methyl groups (80.4%)
dNMPkinase (EC 2.7.4.13) [P04531]	Swiss-Prot	EC 2.7 Transferase of phosphorus-containing groups (99.0%);	+
		EC 2.4 Glycosyltransferase (96.4%);
		EC 1.1 Oxidoreductase of the CH-OH group of donors (71.3%)
Endonuclease II (EC 3.1.21.1) [P07059]	Swiss-Prot	EC 3.1 Hydrolase of ester bonds (99.0%)	+
Endonuclease V (EC 3.1.25.1) [P04418]	Swiss-Prot	EC 3.1 Hydrolase of ester bonds (99.0%)	+
Exonuclease (EC 3.1.11.3) [P03697]	Swiss-Prot	EC 3.1 Hydrolase of ester bonds (99.0%);	+
		EC 4.1 Carbon–carbon lyases (88.1%);
		EC 2.7 Transferase of phosphorus-containing groups (68.5%);
		EC 1.1 Oxidoreductase of the CH-OH group of donors (58.6%)
Ribonuclease (EC 3.1.-.-)[P13312]	Swiss-Prot	EC 3.1 Hydrolase of ester bonds (99.0%)	+
Intron-associated endonuclease 1 (EC 3.1.-.-) [P13299]	Swiss-Prot	EC 3.1 Hydrolase of ester bonds (99.0%);	+
		DNA-binding protein (83.9%)
Intron-associated endonuclease 2 (EC 3.1.-.-) [P07072]	Swiss-Prot	EC 3.1 Hydrolase of ester bonds (99.0%)	+
Putative adenine-specific methylase (EC 2.1.1.72) [P51715]	Swiss-Prot	EC 2.1 Transferase of one-carbon groups (99.0%);	+
		Outer membrane (58.6%);
		mRNA-binding protein (58.6%)
Protein kinase (EC 2.7.1.37) [P00513]	Swiss-Prot	EC 2.7 Transferase of phosphorus-containing groups (99.0%)	+
Slt35 (EC 3.2.1.-) [P41052]	Swiss-Prot	Outer membrane (99.0%)	−
		EC 1.1. Oxidoreductase acting on the CH-OH group of donors (89.3%)
		EC 4.1. Carbon–carbon lyase (62.2%)
Ammonia monooxygenase (EC 1.13.12.-)[Q04508]	Swiss-Prot	EC 1.13. oxygenase (99.0%)	+
		Transmembrane (99.0%)
		EC 2.4. Glycosyltransferases (83.9%)
2-Aminomuconate deaminase (EC 3.5.99.5) [P81593]	Swiss-Prot	EC 3.5. Hydrolase acting on carbon–nitrogen bonds, other than peptide bonds (99.0%)	+
		EC 3.4. Peptidase (58.6%)
ADP-ribosyltransferase (EC2.4.2.37) [P14299]	Swiss-Prot	Transmembrane (92.9%)	^*
		EC 2.4. Glycosyltransferase (90.3%)
		Outer membrane (58.6%)
Alpha-N-AFase II (EC 3.2.1.55) [P82594]	Swiss-Prot	EC 3.4. Peptidase (91.3%)	−
Aminopeptidase G (EC 3.4.11.-) [Q54340]	Swiss-Prot	EC 3.4. Peptidase (99.0%)	+
		TC 1.C. Channels/pores—pore-forming toxins (proteins and peptides) (58.6%)
Alginate lyase (EC 4.2.2.3) [Q59478]	Swiss-Prot	Transmembrane (96.4%)	−
		EC 3.1. Hydrolase of ester bonds (78.4%)
		Outer membrane (58.6%)
ATPE_YEAST (EC 3.6.3.14) [P21306]	Swiss-Prot	RNA-binding proteins (58.6%)	−
AhdA2cA1c (EC1.14.-.-) [BAC65427.1]	Swiss-Prot	EC 3.1. Hydrolase of ester bonds (82.2%)	−
		DNA-binding protein (80.4%)
		Transmembrane (58.6%)

The symbol +, * and − represent the cases that the top of the predicted family, one of the predicted families, and none of the predicted families matches the enzyme function, respectively.

There are eight pairs of homologous enzymes of different families from previous publications (8,27) that satisfy the stricter criterion, which together with SVMProt predicted top family for each enzyme are given in Table 4. It is found that 5 or 62% of these enzyme pairs are correctly assigned by SVMProt, such an accuracy is comparable to the average sensitivity for the enzyme families and indicative of the sequence-similarity-independent nature of SVM functional assignment.

Table 4.

List of pairs of homologous enzymes of different families and the results of SVM functional family assignment

Enzyme E1 (Swiss-Prot accession number)	EC class (F1)	Enzyme E2 (Swiss-Prot accession number)	EC class (F2)	Sequence similarity (BLAST E-value)	SVM functional family assignment	Assignment status
Glycolateoxidase (P05414)	EC 1.1	IPP isomerase (Q8PW37)	EC 5.3	3.00E−07	E1->F1;E2->F2	+
Creatine amidinohydrolase (P38488)	EC 3.5	Prolinedipeptidase (O58885)	EC 3.4	3.00E−15	E1->F1;E2->F2	+
Cystathionine gamma-synthase (P38675)	EC 2.5	Methionine gamma-lyase (P13254)	EC 4.4	2.00E−15	E1->W;E2->F2	−
Exocellobiohydrolase 1 (P38676)	EC 3.2	Cystathionine gamma-lyase (Q8VCN5)	EC 4.4	1.00E−12	E1->W;E2->F2	−
Maleylacetoacetate isomerase (P57109)	EC 5.2	Glutathione S-transferase zeta class (P57108)	EC 2.5	1.00E−51	E1->F1;E2->F2	+
Tyrosine-protein kinase FRK (P42685)	EC 2.7	Intestinalguanylate cyclase (P70106)	EC 4.6	2.60E−12	E1->F1;E2->F1	−
Glutamate-1-semialdehyde aminotransferase (Q06774)	EC 5.4	4-aminobutyrate aminotransferase (P22256)	EC 2.6	5.70E−32	E1->F1;E2->F2	+
Exodeoxyribonuclease (P37454)	EC 3.1	DNA- (apurinic or apyrimidinic site) lyase (P43138)	EC 4.2	1.60E−96	E1->F1;E2->F2	+

E1-> F1 or E2 -> F2 indicates that enzyme E1 or E2 is assigned into family F1 and F2 respectively. E1-> W or E2 -> W indicates that enzyme E1 or E2 is assigned into a wrong family respectively. The symbol + or − represents the cases that SVM is able or unable to distinguish the two enzymes and exclusively assign them into the respective family.

These results suggest that SVM has some capability for functional family assignment of novel proteins having no homolog, and for distinguishing homologous proteins of different functions. The overall accuracy of SVM is not yet at the same level of that of sequence alignment for homologous proteins. One reason is the imbalance between the number of positive and negative samples. The total number of distinct enzymes in some families is <200, which is significantly smaller than that of a few thousand representative proteins used as the negative samples of the respective family. Such a large data imbalance is known to affect the accuracy of a SVM classification system, and methods for solving these problems are being developed (39). It is likely that not all possible types of proteins, particularly those of distantly related members, are adequately represented in some families. This can be improved along with the availability of more protein data. Not all distantly related proteins of the same function have similar structural and chemical features due to the flexibility at the active site (17). This plasticity needs to be properly formulated. These improvements will enable the development of SVM into a useful tool for facilitating functional study of novel proteins.

SUPPLEMENTARY MATERIAL

Supplementary Material is available at NAR Online.

36 in total

1. Evolution of function in protein superfamilies, from a structural perspective.

Authors: A E Todd; C A Orengo; J M Thornton
Journal: J Mol Biol Date: 2001-04-06 Impact factor: 5.469

2. Multi-class protein fold recognition using support vector machines and neural networks.

Authors: C H Ding; I Dubchak
Journal: Bioinformatics Date: 2001-04 Impact factor: 6.937

3. A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach.

Authors: S Hua; Z Sun
Journal: J Mol Biol Date: 2001-04-27 Impact factor: 5.469

Review 4. Protein function in the post-genomic era.

Authors: D Eisenberg; E M Marcotte; I Xenarios; T O Yeates
Journal: Nature Date: 2000-06-15 Impact factor: 49.962

5. GeneRAGE: a robust algorithm for sequence clustering and domain detection.

Authors: A J Enright; C A Ouzounis
Journal: Bioinformatics Date: 2000-05 Impact factor: 6.937

6. Classifying G-protein coupled receptors with support vector machines.

Authors: Rachel Karchin; Kevin Karplus; David Haussler
Journal: Bioinformatics Date: 2002-01 Impact factor: 6.937

Review 7. Determination of protein function, evolution and interactions by structural genomics.

Authors: S A Teichmann; A G Murzin; C Chothia
Journal: Curr Opin Struct Biol Date: 2001-06 Impact factor: 6.809

8. Predicting protein--protein interactions from primary structure.

Authors: J R Bock; D A Gough
Journal: Bioinformatics Date: 2001-05 Impact factor: 6.937

9. Enzyme function less conserved than anticipated.

Authors: Burkhard Rost
Journal: J Mol Biol Date: 2002-04-26 Impact factor: 5.469

10. Complete nucleotide sequence and genome organization of Grapevine fleck virus.

Authors: Sead Sabanadzovic; Nina Abou Ghanem-Sabanadzovic; Pasquale Saldarelli; Giovanni P Martelli
Journal: J Gen Virol Date: 2001-08 Impact factor: 3.891

20 in total

1. A top-down approach to classify enzyme functional classes and sub-classes using random forest.

Authors: Chetan Kumar; Alok Choudhary
Journal: EURASIP J Bioinform Syst Biol Date: 2012-02-29

2. Computational Approaches for Automated Classification of Enzyme Sequences.

Authors: Akram Mohammed; Chittibabu Guda
Journal: J Proteomics Bioinform Date: 2011-08-23

3. Identification of protein functions using a machine-learning approach based on sequence-derived properties.

Authors: Bum Ju Lee; Moon Sun Shin; Young Joon Oh; Hae Seok Oh; Keun Ho Ryu
Journal: Proteome Sci Date: 2009-08-09 Impact factor: 2.480

4. Protein functional annotation of simultaneously improved stability, accuracy and false discovery rate achieved by a sequence-based deep learning.

Authors: Jiajun Hong; Yongchao Luo; Yang Zhang; Junbiao Ying; Weiwei Xue; Tian Xie; Lin Tao; Feng Zhu
Journal: Brief Bioinform Date: 2020-07-15 Impact factor: 11.622

5. Analysis of protein determinants of host-specific infection properties of polyomaviruses using machine learning.

Authors: Myeongji Cho; Hayeon Kim; Hyeon S Son
Journal: Genes Genomics Date: 2021-03-01 Impact factor: 1.839

6. Classification of lung cancer tumors based on structural and physicochemical properties of proteins by bioinformatics models.

Authors: Faezeh Hosseinzadeh; Mansour Ebrahimi; Bahram Goliaei; Narges Shamabadi
Journal: PLoS One Date: 2012-07-19 Impact factor: 3.240

7. A systematic prediction of multiple drug-target interactions from chemical, genomic, and pharmacological data.

Authors: Hua Yu; Jianxin Chen; Xue Xu; Yan Li; Huihui Zhao; Yupeng Fang; Xiuxiu Li; Wei Zhou; Wei Wang; Yonghua Wang
Journal: PLoS One Date: 2012-05-30 Impact factor: 3.240

8. Enzyme classification with peptide programs: a comparative study.

Authors: Daniel Faria; António E N Ferreira; André O Falcão
Journal: BMC Bioinformatics Date: 2009-07-24 Impact factor: 3.169

9. Proteome-wide prediction of novel DNA/RNA-binding proteins using amino acid composition and periodicity in the hyperthermophilic archaeon Pyrococcus furiosus.

Authors: Kosuke Fujishima; Mizuki Komasa; Sayaka Kitamura; Haruo Suzuki; Masaru Tomita; Akio Kanai
Journal: DNA Res Date: 2007-06-15 Impact factor: 4.458

10. Accurate prediction of secreted substrates and identification of a conserved putative secretion signal for type III secretion systems.

Authors: Ram Samudrala; Fred Heffron; Jason E McDermott
Journal: PLoS Pathog Date: 2009-04-24 Impact factor: 6.823