Literature DB >> 24250217

Prediction of protein essentiality by the support vector machine with statistical tests.

Chiou-Yi Hor1, Chang-Biau Yang, Zih-Jie Yang, Chiou-Ting Tseng.   

Abstract

Essential proteins include the minimum required set of proteins to support cell life. Identifying essential proteins is important for understanding the cellular processes of an organism. However, identifying essential proteins experimentally is extremely time-consuming and labor-intensive. Alternative methods must be developed to examine essential proteins. There were two goals in this study: identifying the important features and building learning machines for discriminating essential proteins. Data for Saccharomyces cerevisiae and Escherichia coli were used. We first collected information from a variety of sources. We next proposed a modified backward feature selection method and build support vector machines (SVM) predictors based on the selected features. To evaluate the performance, we conducted cross-validations for the originally imbalanced data set and the down-sampling balanced data set. The statistical tests were applied on the performance associated with obtained feature subsets to confirm their significance. In the first data set, our best values of F-measure and Matthews correlation coefficient (MCC) were 0.549 and 0.495 in the imbalanced experiments. For the balanced experiment, the best values of F-measure and MCC were 0.770 and 0.545, respectively. In the second data set, our best values of F-measure and MCC were 0.421 and 0.407 in the imbalanced experiments. For the balanced experiment, the best values of F-measure and MCC were 0.718 and 0.448, respectively. The experimental results show that our selected features are compact and the performance improved. Prediction can also be conducted by users at the following internet address: http://bio2.cse.nsysu.edu.tw/esspredict.aspx.

Entities:  

Keywords:  essential protein; feature selection; protein-protein interaction; statistical test; support vector machine

Year:  2013        PMID: 24250217      PMCID: PMC3795531          DOI: 10.4137/EBO.S11975

Source DB:  PubMed          Journal:  Evol Bioinform Online        ISSN: 1176-9343            Impact factor:   1.625


Introduction

Identifying essential proteins is important for understanding the cellular processes in an organism because no other proteins can perform the functions of essential proteins. Once an essential protein is removed, dysfunction or cell death results. Thus, several studies have been conducted to identify essential proteins. Experimental approaches for identifying essential proteins include gene deletion,1 RNA interference,2 and conditional knockouts.3 However, these methods are labor-intensive and time-consuming. Hence, alternative methods for identifying essential proteins are necessary. The essential protein classification problem involves determining the necessity of a protein for sustaining cellular function or life. Among the methods available for identifying essential proteins, machine-learning based methods are promising approaches. Therefore, several studies have been conducted to examine the effectiveness of this technique. Chin4 proposed a double-screening scheme and constructed a framework known as the hub analyzer (http://hub.iis.sinica.edu.tw/Hubba/index.php) to rank the proteins. Acencio and Lemke5 used Waikato Environment for Knowledge Analysis (WEKA)6 to predict the essential proteins. Hwang et al7 applied a support vector machine (SVM) to classify the proteins. Protein-protein interactions (PPIs) are well-known to be significant characteristics of protein function. Several studies have attempted to predict and classify protein function8 as well as analyze protein phenotype9 by studying interactions. A previous study10 further suggested that essential proteins and nonessential proteins can be discriminated by means of topological properties derived from the PPI network. In spite of the above superior properties, however, analyzing PPI experimentally is time-consuming. With the advent of yeast two-hybrid11 high-throughput techniques, which can be used to identify several PPIs in one experiment, obtaining PPI information has become easier. Since a PPI network is similar to a social network in many aspects, some researchers apply social network techniques for analyzing PPI networks. Thus, several topological properties have been extensively explored and studied in recent years. Fundamental properties, such as sequence or protein physiochemical ones, are not subjected to detailed examination in previous studies. This may be because each of these preliminary properties alone is somewhat less relevant to essentiality. However, this information is highly accessible because only sequence information is required to derive these properties. Hence, we included these properties in our study. For topological properties, in addition to physical interactions, we incorporated a variety of interaction information, including metabolic, transcriptional regulation, integrated functional, and genomic context interactions. Our experimental results revealed that these features provide either complement information for essentiality identification or provide other biological justification. To identify the reduced feature subset, which is crucial for biological processes, previous studies have used feature selection techniques. The advantages of this method include storage reduction, performance improvement or data interpretation.12 In accordance with whether the feature selection procedure is bound with the predictor, the method is roughly classified into three categories: filter, wrapper, and embedded. Filter methods often provide a complete order of available features in terms of relevance measures. Methods such as Fisher score,12 mutual information, minimal redundancy and maximal relevance (mRMR),13 conditional mutual information maximization (CMIM),14 and minimal relevant redundancy (mRR)15 belong to this category. Both wrapper and embedded methods involve the selection process as a part of the learning algorithm. The former utilizes a learning machine to evaluate subsets of features according to some performance measurements. For example, sequential backward and forward feature selection12 falls into this category. Embedded methods directly perform feature selection in the learning process and they are usually specific to given learning machines. Example include C4.5,16 Classification and Regression Trees (CART),17 and ID3.18 Additionally, some researchers proposed an information-gain based the feature selection method,19 which examines the effectiveness of classifier combination. In this paper, we used two datasets. The first one was from Saccharomyces cerevisiae. The corresponding PPI data set was Scere 20070107, which was obtained from the DIP database. The data set totally contains 4873 proteins and 17,166 interactions. Our feature set consisted of the features obtained or extracted from the methods proposed by Acencio and Lemke,5 Chin,4 Hwang et al,7 and Lin et al.20 The second data set was from Escherichia coli, which was first compiled by Gustafson et al.21 The data set totally contained 3569 proteins. The associated network information included physical, integrated functional, and genomic context interactions and is collected from Hu et al.22 For both data sets, we propose a modified sequential backward feature selection method for selecting important features. Next, SVM models were built using the selected feature subsets. In this study, the SVM software LIBSVM23 was adopted for classification models. Each model was applied to both imbalanced and balanced data sets. The results were compared with those of previous studies and statistical tests were conducted to examine significance. For the imbalanced S. cerevisiae data, our best results for F-measure and MCC were 0.549 and 0.495, respectively, which outperform the best previous method7 with results of 0.354 and 0.36, respectively. We obtained values of 0.770 and 0.545 for F-measure and MCC in the balanced data experiment, which was superior to the best previous method7 with 0.737 and 0.492, respectively. For experiments examining the E. coli data set, our best values for F-measure and MCC were 0.421 and 0.407, respectively, in the imbalanced data set. In the balanced experiment, the best values for F-measure and MCC were 0.718 and 0.448, respectively. The results are similar to those of Gustafson et al,21 who examined 29 features, but in our method, only five or seven features were used for prediction. To verify whether our improvement was statistically significant, we performed bootstrap cross-validation24 on performance measures.

Background

The data set

In this paper, we used two data sets for experiments: S. cerevisiae and E. coli. The former included PPI network data. We downloaded the data set from the DIP ( http://dip.doe-mbi.ucla.edu/) website.25 The original data set contained 4873 proteins and 17166 interactions. To comply with previous studies, we also adopted the largest connected component of the network data. There were a total of 4815 proteins, including 975 essential proteins and 3840 nonessential proteins. The information of protein essentiality was obtained from the Saccharomyces Genome Database (SGD), which is located at http://www.yeastgenome.org/. Since this data set has been used in several previous studies, we thus obtained and incorporated various related features for experiments. The E. coli data set was obtained from Gustafson et al.21 It contained 3569 proteins, among which 611 are essential. Due to availability and coverage issues, we used another information from three additional networks: physical interaction (PI), integrated functional interaction, and integrated PI and genomic context (GC) network. The information was collected from Hu et al.22 In the above two data sets, the ratio of nonessential proteins to essential proteins was approximately 4:1 and 5:1, respectively. The data imbalance will inevitably led to biased fitting to nonessential proteins during the learning processes. Thus, we constructed another balanced data set. Taking the first data set, for example, we randomly selected 975 nonessential proteins and mixed them with essential proteins to form a balanced data set. In the new data set, the number of nonessential data elements against that of essential elements was equal.

Bootstrap cross-validation

We used bootstrap cross-validation (BCV) to compare the performance of the two classifiers using the k-fold cross-validation. Assume that a sample S = {(x1, y1), (x2, y2), …, (xn, y)} is composed of n observations, where xi represents the feature vector of the ith observation and y denotes the class label associated with xi. A bootstrap sample S* = {(x1*,y1*),(x2*,y2*),…,(xn*,y*)} consists of n observations that are sampled from S with replacement, where 1 ≤ b ≤ B, and B is a constant between 50 and 200. For each sample S*, a k-fold cross-validation was carried out. The performance measure c, such as error rate, was calculated with S*. The procedure was repeated B times and then the average performance measure C = ∑Bb=1c/B was evaluated over the B bootstrap samples. Since the distribution of the bootstrap performance measures was approximately normal, the confidence interval and significance were estimated accordingly.

Performance measures

In this study, the performance measures included precision, recall, F-measure (F1), Matthews correlation coefficient (MCC), and top percentage of essential proteins. Their formulas are given as follows: Precision: TP/TP + FP Recall: TP/TP + FN F-measure: 2 × precision × recall/(precision + recall) MCC: Top percentage of essential protein: . Here, an essential protein is represented by the positive observation. True positive (TP), true negative (TN), false positive (FP), and false negative (FN) represent the numbers of true positive, true negative, false positive, and false negative proteins, respectively. The value n denotes the total number of predictions. In addition, receiver operating characteristic (ROC) curve18 and area under curve (AUC) were used to evaluate the classification performance.

Feature extraction

The feature set we used included sequence properties (S), such as amino acid occurrence and average amino acid PSSM; protein properties (P), such as cell cycle and metabolic process; topological properties (T), such as bit string of double screening scheme and betweenness centrality related to physical interactions; and other properties (O), such as phyletic retention and essential index. There were a total of 45 groups and 90 features in the S. cerevisiae data set. For the E. coli data set, there were 35 groups and 80 features. All names and sources are shown in Table 1. Only Bit string of double screening scheme is presented.
Table 1

Protein features.

IDProperty nameTypeSizeSub-namesS. cereE. coli
1Amino acid occurrence20S20AY
2Average amino acid PSSM20S20AY
3Average cysteine position20S1
4Average distance of every two cysteines20S1
5Average hydrophobic20S1
6Average hydrophobicity around cysteine20S41 … 4
7Cysteine count20S1
8Cysteine location20S51 … 5
9Cysteine odd-even index20S1
10Protein length20S1
11Cell cycle5P1
12Cytoplasm5P1
13Endoplasmic reticulum5P1
14Metabolic process5P1
15Mitochondrion5P1
16Nucleus5P1
17Other process5P1
18Other localization5P1
19Signal transduction5P1
20Transport5P1
21Transcription5P1
22Betweenness centrality related to all interactions41T1
23Betweenness centrality related to metabolic interactions5T1
24Betweenness centrality related to physical interactions5T1
25Betweenness centrality transcriptional regulation interactions5T1
26Bit string of double screening scheme [this paper]T1
27Bottleneck8,41T1
28Clique level7T1
29Closeness centrality42T1
30Clustering coefficient7T1
31Degree related to all interactions43T1
32Degree related to physical interactions5T1
33Density of maximum neighborhood component4T1
34Edge percolated component9T1
35Indegree related to metabolic interaction5T1
36Indegree related to transcriptional regulation5T1
37Maximum neighborhood component4T1
38Neighbors’ intra-degree7T1
39Outdegree related to metabolic interaction5T1
40Outdegree related to transcriptional regulation interaction5T1
41Betweenness centrality related to integrated functional interaction22T1
42Betweenness centrality related to integrated PI and GC network22T1
43Degree related to integrated functional interaction22T1
44Degree related to integrated PI and GC network22T1
45Common function degree7O1
46Essential index7O1
47Identicalness5O1
48Open reading frame length7O1
49Phyletic retention21O1
50Number of paralagous genes21O1
51Codon Adaptation Index (CAI)21,44O1
52Codon Bias Index (CBI)21,44O1
53Frequency of optimal codons21,44O1
54Aromaticity score21,44O1
55Leading strand of the circular chromosome21O1
Total1009080

Notes:S. cere and E. coli mean Saccharomyces cerevisiae and Escherichia coli datasets, respectively. For topological features, if not particularly mentioned, they are related to physical interactions. Due to coverage or availability issue, we adopt different features for S. cere and E. coli datasets. For example, interactions in E. coli data set contain integrated functional, PI, and GC network information while those in S. cere include metabolic, transcriptional regulation and PI network information.

Abbreviations: GC, genomic context; PI, physical interactions.

The remaining features are detailed in the Appendix. Lin et al26 and Chin4 proposed the double screening scheme. They used multiple ranking scores to sort essential proteins. The drawback is that each protein does not have a unique score. Thus, we propose a bit string implementation to incorporate these two properties into a single score. An example of our bit string implementation is shown in Tables 2 and 3. Suppose that four proteins, W, X, Y, and Z, are to be ranked. In the first iteration, we desire to find the top one protein. We first select the top 2 proteins using the ranking method A, which are W and X. Next, we use method B to rank these two proteins. The ranks of W and X are 2 and 1, respectively. Hence, in the first iteration, X is finally selected. It follows the bit M [X, 1] is set to 1, and others, M [W, 1], M [Y, 1] and M [Z, 1], are set to 0. In the second iteration, 2 top-ranking proteins are to be found. First, four proteins W, X, Y and Z are selected, because they are the top 4 proteins by ranking method A. Next, with ranking method B, we select the top 2 proteins from them, which are X and Y. Hence, the bits corresponding to M [X, 2] and M [Y, 2] are set to 1, and the others are set to 0 in this iteration. Finally, we sum up the bits of each protein, as shown in the fourth column of Table 3.
Table 2

Ranking by two different methods, where smaller numbers indicate higher ranks.

Protein nameRanking method

A (DMNC)B (MNC)
W14
X22
Y31
Z43
Table 3

Bit strings by the double screening method.

Protein namei th iterationSum of bit stringnrSum

1st2nd
W00000
X11224
Y01134
Z00011
There is still an issue in the bit string implementation, that is, M may be too sparse to be handled by classifiers. Since the number of proteins being selected is around n/2, the sum of about n/2 bits is close to 0. In our experience, this makes it difficult to distinguish between proteins. To overcome this problem, for each protein, we added another score n – r to the sum of the bit string, where r is the rank of the protein by the ranking method B. In this study, we used DMNC to rank A and MNC to rank B. In this example, n = 4, so the values n – r of W, X, Y and Z are 0, 2, 3, and 1, respectively. We summed the values with the bit string; hence, the final scores are 0, 4, 4, and 1. The overall procedure is given in the Procedure bit string implementation of DSS.

Sequential backward feature selection method

SVM is a well-established tool for data analysis which has been shown to be useful in various fields, such as text summarization,27 intrusion detection,28 and image coding.29 In this study, we utilized the SVM software developed by Chang and Lin, called LIBSVM.23 To address the data imbalance, we propose the modified sequential backward feature selection method. Since most data were nonessential, choosing only accuracy as an objective or adopting conventional feature ranking schemes favored negative data. As more and more features were excluded, overall accuracy declined. Since the number of negative data elements was higher than that of positive factors, the true-positive rate thus decreased more than the true-negative rate. Thus, features should be selected that most positive samples are correctly classified while not deteriorating the overall accuracy too much. In this sense, rather than using only accuracy to guide the feature selection, we used a composite score C as the objective function. The composite score was represented in terms of precision (P), recall (R), F-measure (F), and MCC (M) and was given as C = w * P + w * R + w * F + w * M. The four adjustable weights, w, w, w, and w, were used, leading to compromise among the associated performance measures. An additional punishment was imposed to C to allow scores associated with fewer features could compete with those with more features. That is, where S denotes the selected feature subset. |S| and t denote the size of S and the goal number of features specified by a user, respectively. The unit step function u(|S| – t) = 0 as |S| – t ≤ 0, otherwise u(|S| – t) = 1. Finally, a threshold ρ was adopted to make ensure that the improvement over feature changes was not a random process. The value of ρ was estimated by comparing average score difference among feature subsets of sizes p and p + 1 in the preliminary run for given several different values of p. The value e denotes the penalty score when an additional feature is selected. The score is also specified by the user and should be slightly larger than ρ to encourage feature subsets of smaller sizes. The procedure for feature selection is described in Procedure backward feature selection.

Experimental procedure and results

For comparison purposes, we used two feature selection methods: mRMR and CMIM. In the S. cerevisiae data set, our results were also compared to those of Acencio and Lemke5 and Hwang et al.7 For the E. coli data set, we also compared our results with those of Gustafson et al.21

Experimental procedure

The overall procedure of our experiments is illustrated in Figure 1 and is described as follows.
Figure 1

Flowchart for the construction of SVM models and performance comparison.

Stage 1: Determine benchmark feature set

For the S. cerevisiae data set, we used Hwang’s feature set as the benchmark. For the E. coli data set, we used Gustafson’s feature set as the benchmark. These two feature sets are considerably effective for various performance measures.

Stage 2: Tune SVM parameters for best performance

For the above two feature sets, we first ran the SVM software using the feature sets of Hwang or Gustafson and tuned the SVM parameters to achieve the highest average performances.

Stage 3: Adopt best performances as reference performances

After determining the best SVM parameters for the feature sets of Hwang and Gustafson, we recorded the SVM parameters and results. To compare our results with other models, such as those obtained using our methods, mRMR and CMIM, we used the same SVM software and adjust the cost parameters of SVM in order to achieve similar levels of precision.

Stage 4: Perform feature selection

We randomly chose 50% of available data. Next, the backward feature selection procedure was applied to these selected data. In the beginning of our feature selection procedure, we imposed no penalty on the score calculation. Hence, the procedure attempts to achieve the highest score. In the subsequent runs, we added penalties for feature sizes to the score calculation. Subsets with smaller feature size but only slightly inferior in performance were selected. To compare our results with those of other methods, we also used the mRMR and CMIM feature ranking methods and chose subsets as in Procedure backward feature selection.

Stage 5: Perform 10-fold and bootstrap cross-validations

The data were prepared in both balanced and imbalanced manners. For each data set, we randomly partitioned all data into 10 disjoint groups and used the feature subsets selected in the previous stage to calculate various performance measures. The data were prepared in both balanced and imbalanced manners. The 10-fold cross validation was repeated 10 times and average performance measures were computed. Next, a bootstrap sampling procedure was conducted and 200 bootstrap samples were produced, including both balanced and imbalanced samples. Each bootstrap sample was also partitioned for 10-fold cross validations and performance measures were calculated. Note that all models were examined by the same sets of data partitions for conventional and bootstraping cross-validations.

Stage 6: Perform significance tests

Once bootstrap cross validations were carried out, the significance tests were adopted accordingly. In addition to the average values of AUC, precision, recall, F-measure, and MCC, we conducted a statistical significance test for these performance measures. Additionally, we calculated ROC curves and top percentage values for imbalanced experiments.

Backward feature selection and mRMR/CMIM feature ranking

We used 50% of available data elements for feature selection. Taking the S. cerevisiae data set as an example, only (3840 + 975) × 50% observations were randomly chosen for the Backward feature selection procedure. During the procedure, several performance scores were calculated by means of k-fold cross-validations. In the first run, the parameters were set as follows: k = 2, w = 1, w = 1, w = 1, w = 1, ρ = 0.005, e = 0, t = 0, and r = 5. Since all associated weights were equal, the procedure sought the best compromise among all performance measures. t = e = 0, meaning that no goal number of selected features was imposed, giving the procedure opportunities to exploit all available feature combinations to achieve the best performance. In this initial run, we obtained a feature subset of 18 in size. For the subsequent runs, the value of t was decreased starting from 17 (= 18 – 1) until performances were significantly worse than those of Hwang et al.7 In order to obtain feature subsets of reduced sizes, parameters were set: k = 2, w = 1.03, w = 1, w = 1, w = 1, ρ = 0.005, e = 0.01, and r = 5. The reason for setting w = 1.03 was to prevent the true positive rate from decreasing too much. In addition, e > ρ was to allow the procedure to be proceeded to fewer features. The above settings were used to encourage selection of reduced feature subsets. For each setting of t, we executed the procedure 10 times with different k-fold partitions and obtained 10 feature subsets of the same size. Since these 10 resultant feature subsets were slightly different, we performed another 5-fold cross-validation with these feature subsets and compared their performance scores. The one with the highest score were finally preserved as our feature subset. In addition to the methods of Hwang et al or Gustafson et al,21 we also used mRMR13 and CMIM14 feature selection methods for comparison. Using mRMR as an example, the data used in our feature selection procedure were input into the mRMR program, which produced the ranking score of each feature. The feature with the least score was removed first and a subsequent 5-fold cross-validation with the preserved features is performed to calculate the composite score C (S). That is, in the ith iteration, the features with the lowest i ranking scores were removed and C (S) was calculated. The removal and cross-validation procedure was repeated until no feature was preserved. The entire process (including random choice of 50% data and feature removal) was executed 10 times and the feature subset of the same size with the highest score was recorded. Table 4 shows the selected feature subsets of different sizes for S. cerevisiae data. The second column of the table lists all selected features. Each N in the first row represents the feature subset of size i, 4 ≤ i ≤ 18, which was found using our backward feature selection procedure. For each feature subset N, a bullet (•) mark below in the same column was used to indicate which feature was included. The most competent feature subsets selected by CMIM and mRMR were denoted by C32 and m31, which means that 32 and 31 features were selected, respectively. It is observed that a total of 60 features have been selected, which represent the prominent proteins used to identify essential proteins.
Table 4

Selected features for S. cerevisiae data set.

FeatureN5N6N7N8N9N10N11N12N13N14N15N16N17N18m31C32TOT
1PR (phyletic retention)16
2EI (essentiality index)16
3Cytoplasm15
4Nucleus15
5Occurrence of A.A. I13
6Bit string of DSS12
7Occurrence of A.A. W12
8Endoplasmic reticulum11
9Other process7
10Occurrence of A.A. S7
11Occurrence of A.A. G6
12KLV (clique level)6
13Cell cycle5
14Average hydrophobic5
15Average PSSM of A.A. R5
16B.C. related to PI4
17Occurrence of A.A. E4
18Average PSSM of A.A. P4
19ID related to T.R.3
20B.C. T.R. interactions3
21Other localization3
22DMNC3
23Average HYD around C-23
24Signal transduction2
25Edge percolated component2
26Occurrence of A.A. P2
27Occurrence of A.A. T2
28Occurrence of A.A. Y2
29Average PSSM of A.A. Q2
30Average PSSM of A.A. E2
31CLC (clustering coefficient)2
32FunK (common function degree)2
33OD related to T.R. interaction1
34OD related to M.I.1
35ID related to M.I.1
36B.C. related to M.I.1
37Degree related to PI1
38Metabolic process1
39Bottleneck1
40MNC1
41Occurrence of A.A. A1
42Occurrence of A.A. C1
43Occurrence of A.A. D1
44Occurrence of A.A. H1
45Occurrence of A.A. K1
46Occurrence of A.A. M1
47Average C position1
48Protein length1
49Cysteine count1
50Cysteine odd-even index1
51Average HYD around C-11
52Cysteine location-11
53Average PSSM of A.A. A1
54Average PSSM of A.A. D1
55Average PSSM of A.A. S1
56Average PSSM of A.A. W1
57Average PSSM of A.A. Y1
58ORFL (ORF length)1
59CC (closeness centrality)1
60BC (B.C.)1

Abbreviations: DSS, double screening scheme; A.A., amino acid; B.C., betweenness centrality; T.R., transcriptional regulation; HYD, hydrophobicity; PI, physical interaction; A … Y, amino acid abbreviation; M.I., metabolic interaction; OD, outdegree; ID, indegree; m31, mRMR31; C32, CMIM32; FunK, Common function degree; TOT, total.

After the feature subsets were selected, to conduct performance comparison as well as to cope with the randomness, we used Hwang’s method to perform 10 10-fold cross-validations. Here, the true positive rates and false positive rates were input into a diferent software program to calculate ROC curves and AUC values. In this study, the software package we used is ROCR, which was developed by Tobias et al.30,31 Thus, the reported performance measures, including AUC, F1, MCC, precision, and recall values and ROC curves, were averaged over 10 10-fold cross-validations. For the S. cerevisiae data set, the predictor with Hwang’s 10 features served as a benchmark because it yielded distinguished results in terms of feature size and performance. Additionally, mRMR or CMIM were adopted for comparison. We appled the same procedure for the E. coli data set. The selected features are shown in Table 5, with a total of 43 features selected. In the table, feature subsets selected by CMIM and mRMR are denoted by C9 and m13, meaning that 9 and 13 features were selected, respectively.
Table 5

Selected features for E. coli data set.

FeatureN4N5N6N7N8N9N10N11N12N13C9m13TOT
1PR (phyletic retention)12
2Open reading frame length8
3Average PSSM of A.A. C6
4Degree related to F.I.6
5Degree related to A.I.5
6Degree related to PI5
7Average PSSM of A.A. A4
8Average PSSM of A.A. R4
9Average hydrophobic4
10Bit string of DSS for PI4
11Paralog count4
12Occurrence of A.A. M3
13Occurrence of A.A. W3
14Occurrence of A.A. E2
15Occurrence of A.A. F2
16Occurrence of A.A. G2
17Occurrence of A.A. I2
18Average PSSM of A.A. Y2
19Cysteine location-42
20KLV (clique level) for PI2
21Degree related to PI and GC2
22Strand bias2
23Occurrence of A.A. A1
24Occurrence of A.A. C1
25Occurrence of A.A. H1
26Occurrence of A.A. P1
27Occurrence of A.A. S1
28Average PSSM of A.A. N1
29Average PSSM of A.A. G1
30Average PSSM of A.A. K1
31Average PSSM of A.A. F1
32Average PSSM of A.A. T1
33Average PSSM of A.A. V1
34Average distance of every two Cs1
35Average HYD around C-21
36Cysteine location-11
37Cysteine location-51
38Cysteine odd-even index1
39Protein length1
40Bottleneck for PI1
41CC (closeness centrality) for PI1
42MNC for PI1
43B.C. related to all F.I.1

Abbreviations: C9, CMIM09; m13, mRMR13; TOT, total; DSS, double screening scheme; F.I., integrated functional interaction; A.I., all interactions. PI, physical interaction; HYD, hydrophobicity; A.A., amino acid; A … Y, amino acid abbreviation.

Bootstrap cross validations

During the bootstrapping stage, for each bootstrap sample, an identical 10-fold partition was employed for all feature subsets to carry out cross-validations and compute various average performance measures. The procedure was repeated for 200 distinct bootstrap samples. In order to perform parametric significance tests, we evaluated whether the distribution of the resultant performance measures was normal and the variances obtained from different feature subsets were similar. Consequently, 200 results of each performance measure for each feature subset were subjected to the Kolmogorov-Smirno test.31 This test examines the null hypothesis that no systematic difference exists between the standard normal distribution and the underlying distribution against the alternative one that asserts a systematic difference. The threshold was set to 0.05. If the P-value was less than 0.05, we rejected the null hypothesis. For CMIM and mRMR, only the most prominent values are shown. Figures 2 and 3 illustrate the results for S. cerevisiae and E. coli data sets, respectively, in which the test values were recorded according to the feature subsets, performance measures, and experiment types. For the S. cerevisiae data set, the lowest P-value, 0.186, was observed for AUC of the N8-imbalanced experiment. Therefore, it is likely that there was no significant difference between the normal distribution and the distribution of every performance measure of each feature subset. For the E. coli data set, most performance measures were normal with the exception of the recall values associated with N12-balanced (with P-value = 0.035), for which comparing results were not reliable. For the performance measures associated with each model, we also listed their confidence interval and information odds ratios,32 which are shown in the Appendix.
Figure 2

The P-value of the normality test in S. cerevisiae data set.

Figure 3

The P-value of the normality test in S. cerevisiae data set.

For a certain performance measure, since the variances obtained by various feature subsets were quite similar, we used an analysis of variance33 (ANOVA) test to examine whether differences existed among performance measures of different feature subsets. Here, one variance can be obtained from the multiple experiments with a feature subset. Differences existed according to the ANOVA. Next, all of these measures were compared with their associated benchmark to calculate performance deviations. The average deviation corresponding to each type of performance measure was evaluated using the 95% confidence interval covering 0 to determine significance.

Performance comparison and significance tests

In this section, we compared our experimental results with those associated with other feature selection methods and previous studies. For conciseness, we only show the most prominent results associated with mRMR and CMIM. We observed that feature sizes identified by these two methods were relatively large. To compare the feature subsets of smaller sizes, their comparison and their working principles are detailed in the Appendix.

S. cerevisiae

Table 6 lists the average values of five performance measures associated with a variety of feature subsets, which were obtained by 10 10-fold cross-validations for imbalanced data. We adjusted the SVM cost parameters in order to achieve similar levels of precision. The first four rows show results of CMIM32 (32 features), mRMR31 (31 features), Hwang’s (10 features), and Acencio’s (23 features). Values following these items are enclosed by parentheses and represent the numbers of features. Results produced by our method are listed in the subsequent rows of the table. Significance tests were carried out with the bootstrap cross-validations over 200 bootstrap samples. The first three symbols, which can be plus (+) or minus (−), following each numerical value represents significantly higher or lower than benchmark results. For those which serve as benchmarks are marked by star (*) symbols for clarity. For example, the recall of N6 was significantly higher than that of Hwang, while its AUC was significantly lower than those of mRMR31 and Hwang. For the feature subsets with a prefix name ‘N’, their fourth symbols behind numerical values are used to indicate the significance between two neighboring rows. For example, for N7, its AUC was significantly higher than that of N6 and its recall value was also significantly higher than that of N8. For values of the same performance measure in each column, the best is underlined. Values in the last row show the results with the full set of 90 features.
Table 6

Performance comparison for the imbalanced S. cerevisiae data set.

AUCPrecisionRecallF-measureMCC
CMIM320.825*(+)(+)0.744*(+)0.369*(+)0.493*(+)0.450*(+)
mRMR310.821(−)*(+)0.738*0.372*(+)0.495*(+)0.449*(+)
Hwang(10)0.775(−)(−)*0.743(−)*0.343(−)(−)*0.469(−)(−)*0.432(−)(−)*
Acencio(23)0.707(−)(−)(−)0.675(−)(−)(−)0.121(−)(−)(−)0.204(−)(−)(−)0.228(−)(−)(−)
N40.744(−)(−)(−)0.7820.327(−)(−)(−)0.461(−)(−)(−)0.439(−)(−)
N50.727(−)(−)(−)0.741(−)(−)0.387(−)(−)(+)0.509(−)(−)(+)0.461(−)(−)(+)
N60.730(−)(−)(−)0.752(−)0.395(−)(−)(+)0.518(−)(−)0.472(−)(−)
N70.761(−)(−)(+)0.767(−)0.386(−)(−)(+)0.513(−)(−)(+)0.473(−)(−)
N80.772(−)(−)0.7550.371(−)(−)(−)0.498(−)(−)(+)(−)0.457(−)(−)(−)
N90.782(−)(−)(+)0.7490.382(−)(+)(+)0.506(−)(+)0.462(−)(+)
N100.781(−)(−)(+)0.7510.399(+)0.521(+)0.474(+)
N110.786(−)(−)(+)0.7520.402(+)0.524(+)0.476(+)
N120.798(−)(−)(+)0.7590.409(+)0.532(+)0.485(+)
N130.789(−)(−)(+)0.7480.433(+)(+)0.549(+)0.495(+)
N140.802(−)(+)0.7490.397(+)(−)0.519(+)(−)0.471(+)(−)
N150.801(−)(+)0.7630.406(+)0.530(+)0.485(+)(+)
N160.814(−)(+)(+)0.7620.401(+)0.525(+)0.480(+)
N170.814(−)(+)0.7610.407(+)0.530(+)0.484(+)
N180.811(−)(+)0.7510.411(+)0.531(+)0.482(+)
N900.829(+)(+)(+)(+)0.738(+)(+)0.355(+)(+)(−)0.479(+)(+)(+)(−)0.438(+)(+)(+)(−)

Note: With the polynomial kernel function, the values of precision, recall and MCC are reported as 0.77, 0.23, and 0.36, respectively, in the original paper of Hwang et al.7

Based on Table 6, CMIM32, mRMR31 and Hwang’s predictors outperformed Acencio’s in all performance measures. For our feature subsets, the performance measures were slightly higher than Hwang’s. For those of N8, there was no performance difference from Hwang’s in AUC, while the remaining measure values were higher than Hwang’s. When the feature size exceeded 8, except for precision values, improvement over Hwang’s was consistently significant in most cases. For comparison with mRMR, our method performed nearly as well as mRMR31 when the feature size was between 9 and 13 with the exception of AUC values. When the feature size ranged from 14 to 18, there was no performance difference between our model and mRMR31. The most prominent predictor was CMIM32. Except for AUC values, our results achieved similar levels of performance when the size of features exceeded 14. Note that the number of features in CMIM32 and mRMR31 were 32 and 31, which was much higher than ours. Table 7 shows the average performance measures in balanced experiments of the S. cerevisiae data set, which were also obtained via 10 10-fold cross-validations. For those of our feature subsets with size ranging from 5 to 18, nearly all performance measures were the same as or higher than those of Hwang’s. This shows that the feature subsets with sizes exceeding 5 are at least as good as Hwang’s. Additionally, those with 12 or more features achieved significant improvement. Compared with CMIM32 and mRMR31, our results showed similar levels of performance when the size of features exceeded 15. The results with the full set of 90 features are shown in the last row, whose performance measures were similar to those from N14 to N18.
Table 7

Performance comparison for the balanced S. cerevisiae data set.

AUCPrecisionRecallF-measureMCC
CMIM320.842*(+)0.772*0.766*(+)0.769*(+)0.540*(+)
mRMR310.836*(+)0.765*0.741*(+)0.752*(+)0.513*(+)
Hwang(10)0.822(−)(−)*0.778*0.720(−)(−)*0.748(−)(−)*0.516(−)(−)*
Acencio(23)0.768(−)(−)(−)0.696(−)(−)(−)0.734(−)(−)0.714(−)(−)(−)0.414(−)(−)(−)
N40.811(−)(−)(−)(+)0.777(−)(+)0.716(−)(−)0.745(−)(−)(+)0.512(−)(−)(+)
N50.824(−)(−)(+)0.778(−)0.735(−)(−)(+)0.756(−)(−)0.527(−)(−)
N60.827(−)(−)(+)0.778(−)0.739(−)(−)0.758(−)(−)0.530(−)(−)
N70.831(−)(−)0.7790.733(−)(−)0.755(−)(−)0.526(−)(−)
N80.826(−)(−)0.7860.721(−)(−)0.752(−)(−)0.527(−)(−)
N90.833(−)(−)(+)0.7910.735(−)(−)(+)0.762(−)(−)0.541(−)(−)
N100.834(−)(−)0.7890.736(−)(−)0.761(−)(−)0.540(−)(−)
N110.831(−)(−)0.7840.737(−)(−)0.760(−)(−)0.535(−)(−)
N120.829(−)(−)0.7790.7320.755(+)0.526(+)
N130.834(−)(−)0.7880.730(−)(−)0.758(−)(−)0.535(−)(−)
N140.836(−)(−)(+)0.7770.743(−)(−)(+)0.759(−)(−)(+)0.530(−)(−)
N150.843(−)(−)(+)0.7840.748(−)(−)(+)0.766(−)(+)0.542(−)(+)
N160.842(+)0.7770.756(+)0.767(+)0.540(+)
N170.847(+)0.7780.763(+)0.770(+)0.545(+)
N180.840(−)(−)(+)(−)0.779(−)0.740(−)(−)(+)(−)0.759(−)(−)0.531(−)(−)
N900.839(+)(+)0.7600.753(+)0.757(+)0.516(+)

Note: In the original paper of Hwang et al7 the values of precision, recall, F-measure and MCC are reported as 0.763, 0.713, 0.737, and 0.492, respectively, with the polynomial kernel function.

In Table 6, we can observe that feature subsets N5, N7, N9, N13, N15, and N16 showed significant improvement in performance but were smaller in feature sizes when compared with neighboring rows. In Table 7, the significant subsets were N5, N6, and N9. In addition, as shown in Tables 6 and 7, our models performed equally well as CMIM32 and mRMR31 when the feature size was 16 or 17. We used N5, N9, and N16 to draw ROC curves.

E. coli

Tables 8 and 9 shows the average values of five performance measures associated with a variety of feature subsets, which were obtained by 10 10-fold cross-validations for imbalanced and balanced experiments, respectively. The first two rows show results of CMIM09 (9 features) and Gustafson’s (29 features).
Table 8

Performance comparison for imbalanced E. coli data set.

AUCPrecisionRecallF1MCC
CMIM090.701*(−)0.720*0.271*(−)0.394*(−)0.382*(−)
mRMR130.715*(−)0.713*0.250*(−)0.370*(−)0.360*(−)
Gustafson(29)0.711(+)(+)*0.720*0.290(+)(+)*0.420(+)(+)*0.413(+)(+)*
N40.691(−)(−)(−)0.7250.280(+)(−)(−)0.404(+)(−)0.391(−)
N50.690(−)(−)(−)0.7370.295(+)(+)(+)0.421(+)(+)(−)(+)0.407(+)(−)(+)
N60.701(−)0.7420.287(+)(+)0.414(+)(+)0.403(+)(+)
N70.714(−)0.7350.275(+)(+)(−)0.400(+)(+)(−)0.392(+)
N80.705(−)0.7420.288(+)(+)(−)(+)0.415(+)(+)(−)(+)0.405(+)(−)
N90.707(−)0.726(−)0.293(+)(+)0.417(+)(+)0.401(+)(+)
N100.711(−)0.7240.294(+)(+)0.418(+)(+)0.401(+)(+)
N110.714(−)0.7320.278(+)(+)(−)0.403(+)(+)0.393(+)(+)
N120.712(+)0.7250.292(+)(+)0.416(+)(+)0.400(+)(+)
N130.714(+)0.7330.287(+)(+)0.413(+)(+)0.400(+)(+)
N800.716(+)(+)(+)0.677(+)(+)(+)(−)0.237(+)(+)(−)0.352(+)(+)(−)0.339(+)(+)(+)(−)
Table 9

Performance comparison for balanced E. coli data set.

AUCPrecisionRecallF1MCC
CMIM090.767*(+)0.720*0.700*(+)0.710*(+)0.421*
mRMR130.762(−)*(−)0.728*0.654(−)*(−)0.689(−)*(−)0.396*(−)
Gustafson(29)0.777(+)*0.722*0.715(+)*0.719(+)*0.440(+)*
N40.780(+)0.7330.701(+)0.717(+)0.446(+)
N50.779(+)0.7300.706(+)0.718(+)0.445(+)
N60.762(−)(−)0.7350.663(−)(−)0.6960.425
N70.783(+)(+)0.7370.696(+)0.716(+)(+)0.448(+)
N80.781(+)0.7230.711(+)0.717(+)0.439(+)
N90.782(+)0.7150.703(+)0.709(+)0.423
N100.781(+)0.7250.702(+)0.713(+)0.436(+)
N110.777(+)0.7190.700(+)0.709(+)0.426
N120.776(+)0.7150.695(+)0.705(+)0.418
N130.776(+)0.7310.695(+)0.712(+)0.439(+)
N800.7690.7110.715(+)(+)0.713(+)0.424
Table 8 shows that Gustafson’s predictors outperformed CMIM09 in most performance measures in imbalanced experiments. For our feature subsets, the performance measures were slightly higher than CMIM09. When the feature size exceeded 6, the improvement over CMIM09 was consistently significant. To compare Gustafson’s method with our method, ours almost performed as well as Gustafson’s when the feature size was over 11. Note that the number of features in Gustafson’s was 29, which was higher than ours. Table 9, except for the least effective predictor mRMR13, shows almost no performance difference among most feature subsets in balanced experiments. For further ROC analysis, in addition to CMIM09, mRMR13, and Gustafson’s, we further used N4, N8, N11 and N80 to draw ROC curves. This is because we observed performances of insufficient, middle and full feature sets.

ROC analysis

Figure 4 illustrates the average ROC curves and AUCs of various feature subsets for the imbalanced data experiments. Apart from the most competent predictor CMIM32, although the AUC of N5 is higher than that of Acencio’s, an intersection can be observed at 0.5 on the horizontal axis. This indicates that N5 was a better predictor when the allowed maximal false positive rate was below 0.5. In contrast, when the allowed false positive rate exceeded 0.5, Acencio’s was better than N5. Comparing N9 and Hwang’s method, both AUC values were similar. For the feature subsets with sizes exceeding 8 (not all shown in this figure), all true positive rates were either higher or at least close to Hwang’s. This was also supported by the significance tests in Table 6 and suggests that the feature subsets with sizes exceeding 8 achieved higher performance in AUC than Hwang’s predictor.
Figure 4

The average ROC curves and AUCs for the imbalanced S. cerevisiae data set.

Figure 5 illustrates the average ROC curves and AUCs of various feature subsets for the balanced data experiments. CMIM32 again was the most competent predictor. Additionally, N16 also achieved the same level of AUC. For the feature subsets of sizes ranging from 5 to 18 (not all shown), their true positive rates were either higher or at least close to Hwang’s level. Thus, N5, N6, …, N18 outperformed or performed equally well for various combinations of true and false positive rates in the balanced experiments. Similarly to the imbalanced data set, the more features, the higher the AUC values. However, the improvement in AUC over the feature addition was not as significant as those in the imbalanced experiments. It should be noted that both the ROC curve and AUC of Acencio’s predictor were reproduced by our experiments and thus they were slightly different from the original values reported by Acencio and Lemke.5
Figure 5

The average ROC curves and AUCs for the balanced S. cerevisiae data set.

For imbalanced data set, Figure 6 illustrates the average ROC curves and AUCs of various feature subsets. It shows that all curves were similar below the 10% horizontal range. This indicates that there was little difference when the allowable false positive rate was less than 10%. For the horizontal range above 10%, N80 was the highest performer, Gustafson and N11 were secondary, and N4 was the worst. In contrast to the imbalanced data set, for Figure 7 corresponding to the balanced data set, N4 and N8 were the best performers. The remaining predictors showed few differences.
Figure 6

The average ROC curves and AUCs for the imbalanced E. coli data set.

Figure 7

The average ROC curves and AUCs for the balanced E. coli data set.

Top percentage analysis

Table 10 shows the average top percentage information for the imbalanced data set. The top θ probability is defined as the ratio of the number of truly predicted essential proteins over the top-ranked θ × 975 proteins, where the total number of true essential proteins is 975. The top θ probability shows the likelihood that the proteins are essential if the user decides to choose a specific number of top-ranked candidates. It is slightly different from precision because the top-ranked candidates (or denominator) are not necessary to be classified as essential. CMIM32, mRMR31 and Hwang’s results again served as benchmarks and they are denoted by star ‘*’ symbols in the table. The minus symbol following each value represents that the value was lower than the benchmark results.
Table 10

Percentage of essential proteins in the imbalanced S. cerevisiae data.

Top 5%Top 10%Top 15%Top 20%Top 25%Top 30%Top 50%Top 75%Top 100%
CMIM320.939*0.918*0.910*0.892*0.870*0.839*0.743*0.645*0.582*
mRMR310.955*0.905–*0.884–*0.862–*0.834–*0.820–*0.740–*0.641–*0.572–*
Hwang(10)0.959*0.918*0.871––*0.853––*0.843–*0.816––*0.720––*0.637––*0.563––*
Acencio(23)0.800–0.741–0.693–0.661–0.646–0.625–0.578–0.519–0.457–
N40.9800.9300.905–0.877–0.865–0.8500.727–0.632–0.559–
N50.843–0.861–0.859–0.852–0.841–0.827–0.7510.641–0.530–
N60.908–0.894–0.875–0.857–0.850–0.834–0.7630.635–0.526–
N70.861–0.892–0.897–0.885–0.854–0.832–0.7700.645–0.570–
N80.892–0.904–0.895–0.877–0.868–0.8500.7510.6570.574–
N90.880–0.911–0.895–0.875–0.860–0.832–0.7530.6650.585
N100.882–0.896–0.893–0.882–0.858–0.8460.7620.6650.581–
N110.900–0.900–0.888–0.875–0.861–0.8560.7690.6670.580–
N120.9410.9240.899–0.872–0.866–0.8540.7760.6640.588
N130.9410.910–0.886–0.870–0.853–0.8400.7810.6720.578–
N140.9490.9320.9160.8970.867–0.8450.7590.6670.587
N150.906–0.894–0.897–0.884–0.866–0.8510.7760.6720.584
N160.933–0.901–0.895–0.886–0.864–0.8510.7710.6770.599
N170.9430.903–0.879–0.871–0.866–0.8560.7770.6790.595
N180.937–0.892–0.880–0.870–0.864–0.8540.7780.6830.595
N900.9390.911–0.884–0.869–0.856–0.835–0.728–0.639–0.572–
Both mRMR31 and Hwang’s predictor were extremely effective within the 10% range. This indicates that these predictors were quite preferable when the total number of true essential proteins was known and the allowable top-ranked candidates were within 10%. Most of our predictors outperformed them beyond 10%. For CMIM32, our predictors outperformed it beyond 30%. Thus, N14 may be a better choice because it is relatively effective beyond 10%. Figure 8 depicts the average top percentage curves.
Figure 8

The average top percentage curves for the imbalanced S. cerevisiae data set.

Table 11 shows the average top percentage information for the imbalanced data set. CMIM09, mRMR13, and Gustafson’s results serve as benchmarks. The CMIM09 predictor was the most effective over the entire range. Most of our predictors outperformed these predictors beyond 15%. N9 was the most prominent since it was relatively effective over the entire range. Figure 9 depicts the average top percentage curves.
Table 11

Percentage of essential proteins in the imbalanced E. coli experiment.

Top 5%Top 10%Top 15%Top 20%Top 25%Top 30%Top 50%Top 75%Top 100%
CMIM090.745*0.775*0.730*0.730*0.737*0.727*0.644*0.542*0.440*
mRMR130.719–*0.725–*0.714–*0.701–*0.690–*0.679–*0.614–*0.531–*0.446*
Gustafson(29)0.706––*0.692––*0.705––*0.707–*0.695–*0.685–*0.624–*0.522––*0.436–*
N40.610–0.689–0.7650.7600.7490.7520.6490.534–0.449
N50.655–0.705–0.7470.7470.7450.7430.6530.535–0.443
N60.719–0.723–0.717–0.7360.7440.7480.6580.525–0.435–
N70.568–0.700–0.713–0.7300.7480.7620.6550.540–0.464
N80.671–0.703–0.723–0.7360.728–0.7310.6520.535–0.449
N90.8130.7850.7660.7540.734–0.726–0.6850.5480.459
N100.7940.751–0.728–0.7320.7410.7450.6680.5500.463
N110.719–0.721–0.7340.7420.7480.7480.6680.539–0.457
N120.735–0.738–0.7450.7500.7430.7400.6670.5480.458
N130.655–0.672–0.713–0.7390.736–0.7320.6680.5480.457
N800.674–0.690–0.703–0.705–0.703–0.691–0.632–0.529–0.452
Figure 9

The average top percentage curves for the imbalanced E. coli data set.

Discussion

By inspecting the S. cerevisiae feature subsets listed in Table 4, we observed that the most prominent features indeed come from diverse sources. This includes sequence, protein, topology and other properties. Among these features, amino acid occurrence I, amino acid occurrence W, bit string of double screening scheme, cytoplasm, endoplasmic reticulum, EI (essentiality index), nucleus, and PR (phyletic retention) were selected more than 10 times. Two among the above features, EI and PR, were included in all feature subsets and thus they are regarded as the most important factors for identifying essential proteins. N9 and N8 were the feature subsets that cover most of the above 8 features. Their prediction capability associated with these two feature subsets outperformed Hwang’s results in all performance measures, except for AUC and the top percentage probability at a very low value. Furthermore, two amino acids, which were relatively easy to extract, were included in these two feature subsets. For predictors that were built by the feature subsets of 10 or more features, they were consistently superior to Hwang’s in nearly all performance measures. Interestingly, mRMR and CMIM selected several sequence-derived features, such as PSSM and amino acid occurrence. It thus seems these features are good for essentiality prediction in terms of relevance and feature independence. By analyzing Tables 6 and 7, we recommend using N16, N9, and N6. N16 performed nearly as well as CMIM32 (or mRMR31) and was more compact in feature size. For N9 and N6, by choosing one additional feature, they were significant higher than N8 and N5 in some performance measures. For the E. coli features listed in Table 5, the most and second important features were PR (phyletic retention) and open reading frame length. The rest of important features which were selected more than five times included: average PSSM of amino acid C and degree related to integrated functional interactions. N5, N7, N8, and N13 covered most of these features. Among the 43 listed features, 21 sequence-related features, such as amino acid occurrence and average PSSM, were selected. In this data set, we recommend feature subsets of sizes exceeding 11 because of their effectiveness and compactness in feature size. With experimental results for the two data sets, we conclude that phyletic retention is the most important feature for identifying essential proteins. It is defined as the number of present ortholog organisms. Gustafson et al study21 analyzed different organisms to calculate phyletic retention for E. coli and S. cerevisiae data sets. This is sensible because different species may be associated with different organisms. From the biological view, the retention process over long evolutionary periods suggests that some organisms are crucial for certain cell functionality. By inspecting the top 5 occurrences of amino acids in Tables 4 and 5, we find Tryptophan (W) and Glycine (G) were two top-ranking features. Since both these two amino acids are non-polar and hydrophobic, we may hypothesize that either essentiality is related to these physicochemical properties or that the features possessing discrimination information is not captured by other top-ranking features. In this study, we compiled various interaction information including physical, metabolic, transcriptional regulation, and integrated functional and genomic context interactions. The experimental results revealed that various properties, such as degrees, were more or less identified as important features. This implies that the interaction information, not limited to physical interactions, may also be closely related to essential properties. According to the literature, hubs of the networks, possessing abundance of interaction partners, are important due to the fact that they play central roles in mediating interactions among numerous less-connected proteins. Thus, proteins involved in the complex mediation processes are more likely to be crucial for cellular activity or survival. For the feature selection proposed in this study, let the size of all available and target selected features be m and t, and the maximal retry times be r. The number of SVM cross-validation times is between 1/2 (m + t + 1) × (m − t) and 1/2 (m + t + 1) × (m − t) × r. It takes approximately 1 minute for the LIBSVM software to perform a 2-fold cross-validation on one Power5+ processor of IBM P595 computer. Assuming m = 90, t = 10 and r = 5, the total running time is between 4,000 and 20,000 minutes. The IBM P595 allows users to manually submit several processes into the computer in order to speed up the execution. For example, we can invoke at most 10 SVM processes simultaneously. Consequently, a maximal 10-time speed-up can be achieved and the total running time can thus be reduced. If we inspect Tables 4 and 5, we can find that more than one-third of the features were not significantly relevant and thus were not selected. These features are relatively easy to remove during backward feature selection procedure at the beginning stage. According to the authors’ experience, the rounds of retry r are not critical in this stage. With an increasing number of features removed, the required number of retry must be increased as identifying relatively less competent features becomes increasingly difficult. The number of retry r accompanied by the rest of user-specified parameters (such as the minimal improvement ρ et al) was set appropriately to ensure that the feature selection procedure could proceed.

Conclusion and Future Work

In this study, we incorporated several protein properties, including sequence, protein, topology, and other properties. There was a total of 55 groups and 96 features. The features were included in two data sets for experiments: S. cerevisiae and E. coli. We used a modified sequential backward feature selection to identify good feature subsets and used the SVM software tools for classifying essential proteins. In addition, we built several SVM models for both imbalanced and balanced data sets. As our experimental results illustrate, some features were indeed shown to be effective for essentiality prediction. Feature subsets selected by our method were effective in term of feature size and performance. This is because our method took both feature size and performance into consideration and consequently the resultant feature subset was considerably compact. We compared our experimental results by carrying out significance tests for several types of performance measures. Hence, this provides the potential researcher of essential proteins a practical guide to which feature or method is more prominent. In the imbalanced S. cerevisiae data experiment, our best results for F-measure and MCC were 0.549 and 0.495, respectively, which was associated with the N13 predictor. In contrast, for the same performance measures, we achieved 0.77 and 0.545 in the balanced data experiment, which were associated with the N17 predictor. The experimental results showed that the performance of our models was better than Hwang’s when we selected more than 9 features. If achieving higher accuracy is the main issue, we recommend the N16 model (16 features). When one prefers a compact feature set of small size, we suggest using the N9 model (9 features). We also list important features. These features may be crucial for identifying essential proteins. For E. coli data set, our best values of F-measure and MCC were 0.421 and 0.407 in the imbalanced experiments. In the balanced experiment, the best values of F-measure and MCC were 0.718 and 0.448, respectively. Both of the best results were associated with the N5 predictor. For the data set, we found that predictors associated with the feature size above 11 were indeed comparable to Gustafsons’. There several possible methods for further improving the prediction capability. Features related to the protein sequence properties may also be useful for identifying essentiality. Furthermore, since proteins with similar primary structures may possess similar functions, thus the essentiality may be addressed from the sequence motif perspective.34 In addition to the above approaches, performance can be improved by incorporating other tools or constructing hybrid predictors. Among these, the majority vote35 is a strategy for combining classifiers. This method represents the simplest method for categorical data fusion. According to the literature,36 the prerequisite for improvement arises from the fact that each individual classifier must contain distinct information for discrimination. Otherwise, some negative effects may be imposed on the constructed ensemble.
Table 12

Confidence intervals of performance measures (×100) and informational odds ratios for models produced by the imbalanced S. cerevisiae data set.

AUCPrecisionRecallF1MCCIOR
CMIM3282.5 ± 1.274.4 ± 3.136.9 ± 4.549.3 ± 3.845.0 ± 3.65.2 ± 0.5
mRMR3182.1 ± 1.673.8 ± 3.237.2 ± 4.349.5 ± 3.644.9 ± 3.45.2 ± 0.5
Hwang77.5 ± 2.274.3 ± 3.734.3 ± 4.346.9 ± 4.043.2 ± 3.65.1 ± 0.4
Acencio70.7 ± 3.467.5 ± 6.312.1 ± 5.520.4 ± 7.622.8 ± 6.03.7 ± 0.4
N474.4 ± 2.778.2 ± 3.732.7 ± 4.146.1 ± 4.143.9 ± 3.55.3 ± 0.4
N572.7 ± 3.674.1 ± 4.138.7 ± 4.750.9 ± 4.146.1 ± 3.85.3 ± 0.5
N673.0 ± 3.275.2 ± 4.239.5 ± 4.451.8 ± 3.847.2 ± 3.65.5 ± 0.5
N776.1 ± 2.476.7 ± 3.738.6 ± 4.451.3 ± 3.947.3 ± 3.65.5 ± 0.5
N877.2 ± 2.475.5 ± 3.437.1 ± 4.949.8 ± 4.345.7 ± 3.95.3 ± 0.5
N978.2 ± 2.474.9 ± 3.438.2 ± 4.550.6 ± 3.946.2 ± 3.65.4 ± 0.5
N1078.1 ± 2.275.1 ± 3.539.9 ± 4.152.1 ± 3.647.4 ± 3.55.5 ± 0.5
N1178.6 ± 2.175.2 ± 3.240.2 ± 4.252.4 ± 3.647.6 ± 3.45.5 ± 0.5
N1279.8 ± 2.075.9 ± 3.240.9 ± 4.253.2 ± 3.648.5 ± 3.45.7 ± 0.5
N1378.9 ± 1.974.8 ± 3.243.3 ± 4.354.9 ± 3.449.5 ± 3.45.8 ± 0.5
N1480.2 ± 1.874.9 ± 3.239.7 ± 4.351.9 ± 3.547.1 ± 3.45.5 ± 0.5
N1580.1 ± 1.976.3 ± 3.340.6 ± 4.253.0 ± 3.548.5 ± 3.55.7 ± 0.5
N1681.4 ± 1.776.2 ± 3.240.1 ± 4.652.5 ± 3.848.0 ± 3.65.6 ± 0.5
N1781.4 ± 1.776.1 ± 3.340.7 ± 4.553.0 ± 3.848.4 ± 3.65.7 ± 0.5
N1881.1 ± 1.875.1 ± 3.241.1 ± 4.353.1 ± 3.648.2 ± 3.55.6 ± 0.5
N9082.9 ± 1.073.8 ± 2.835.5 ± 4.547.9 ± 3.643.8 ± 3.45.1 ± 0.4
Table 13

Confidence intervals of performance measures (×100) and informational odds ratios for models produced by the balanced S. cerevisiae data set.

AUCPrecisionRecallF1MCCIOR
CMIM3284.2 ± 1.577.2 ± 2.276.6 ± 2.976.9 ± 2.154.0 ± 3.93.3 ± 0.4
mRMR3183.6 ± 1.676.5 ± 2.474.1 ± 3.075.2 ± 2.151.3 ± 4.13.0 ± 0.3
Hwang82.2 ± 1.777.8 ± 2.672.0 ± 3.774.8 ± 2.351.6 ± 3.93.0 ± 0.3
Acencio76.8 ± 2.269.6 ± 2.473.4 ± 4.071.4 ± 2.341.4 ± 4.32.5 ± 0.3
N481.1 ± 1.877.7 ± 2.571.6 ± 3.674.5 ± 2.351.2 ± 3.93.0 ± 0.3
N582.4 ± 1.877.8 ± 2.673.5 ± 3.575.6 ± 2.252.7 ± 4.13.1 ± 0.3
N682.7 ± 1.877.8 ± 2.673.9 ± 3.675.8 ± 2.353.0 ± 4.13.1 ± 0.3
N783.1 ± 1.877.9 ± 2.573.3 ± 3.575.5 ± 2.252.6 ± 4.03.1 ± 0.3
N882.6 ± 1.878.6 ± 2.572.1 ± 3.575.2 ± 2.352.7 ± 4.03.1 ± 0.3
N983.3 ± 1.879.1 ± 2.473.5 ± 3.476.2 ± 2.254.1 ± 3.83.2 ± 0.3
N1083.4 ± 1.778.9 ± 2.473.6 ± 3.376.1 ± 2.154.0 ± 3.83.2 ± 0.3
N1183.1 ± 1.778.4 ± 2.473.7 ± 3.576.0 ± 2.253.5 ± 3.83.2 ± 0.3
N1282.9 ± 1.877.9 ± 2.573.2 ± 3.175.5 ± 2.252.6 ± 4.23.1 ± 0.3
N1383.4 ± 1.778.8 ± 2.473.0 ± 3.575.8 ± 2.253.5 ± 3.93.1 ± 0.3
N1483.6 ± 1.677.7 ± 2.374.3 ± 3.475.9 ± 2.153.0 ± 3.83.2 ± 0.3
N1584.3 ± 1.778.4 ± 2.474.8 ± 3.376.6 ± 2.154.2 ± 3.93.3 ± 0.4
N1684.2 ± 1.677.7 ± 2.275.6 ± 3.176.7 ± 2.054.0 ± 3.73.3 ± 0.4
N1784.7 ± 1.677.8 ± 2.376.3 ± 3.077.0 ± 2.054.5 ± 3.83.3 ± 0.4
N1884.0 ± 1.677.9 ± 2.474.0 ± 3.375.9 ± 2.053.1 ± 3.83.1 ± 0.3
N9083.9 ± 1.476.0 ± 2.075.3 ± 2.775.7 ± 1.851.6 ± 3.53.1 ± 0.3
Table 14

Confidence intervals of performance measures (×100) and informational odds ratios for models produced by the imbalanced E. coli data set.

AUCPrecisionRecallF1MCCIOR
CMIM0970.1 ± 0.972.0 ± 1.427.1 ± 0.739.4 ± 0.938.2 ± 0.95.2 ± 0.6
mRMR1371.5 ± 2.471.3 ± 5.425.0 ± 5.837.0 ± 6.836.0 ± 5.84.7 ± 0.6
Gustafson71.1 ± 2.366.5 ± 4.625.5 ± 5.036.8 ± 5.234.7 ± 4.84.9 ± 0.6
N469.1 ± 1.872.5 ± 1.028.0 ± 0.640.4 ± 0.739.1 ± 0.75.5 ± 0.6
N569.0 ± 2.073.7 ± 1.429.5 ± 0.842.1 ± 0.940.7 ± 1.05.7 ± 0.6
N670.1 ± 1.774.2 ± 1.428.7 ± 0.941.4 ± 1.140.3 ± 1.05.7 ± 0.6
N771.4 ± 1.473.5 ± 1.327.5 ± 0.740.0 ± 0.939.2 ± 0.95.5 ± 0.6
N870.5 ± 1.374.2 ± 1.128.8 ± 0.841.5 ± 0.940.5 ± 0.95.7 ± 0.6
N970.7 ± 1.572.6 ± 1.429.3 ± 1.041.7 ± 1.240.1 ± 1.25.6 ± 0.6
N1071.1 ± 1.572.4 ± 1.629.4 ± 1.041.8 ± 1.140.1 ± 1.25.6 ± 0.6
N1171.4 ± 1.473.2 ± 1.327.8 ± 0.840.3 ± 1.039.3 ± 1.05.5 ± 0.6
N1271.2 ± 1.472.5 ± 1.929.2 ± 1.141.6 ± 1.240.0 ± 1.35.6 ± 0.6
N1371.4 ± 1.373.3 ± 1.828.7 ± 1.241.3 ± 1.440.0 ± 1.45.6 ± 0.6
N8071.6 ± 0.967.7 ± 2.123.7 ± 1.335.2 ± 1.633.9 ± 1.64.9 ± 0.6
Table 15

Confidence intervals of performance measures (×100) and informational odds ratios for models produced by balanced E. coli data set.

AUCPrecisionRecallF1MCCIOR
CMIM0976.7 ± 1.772.0 ± 2.470.0 ± 3.771.0 ± 2.242.1 ± 4.02.4 ± 0.3
mRMR1376.2 ± 2.072.8 ± 3.165.4 ± 6.468.9 ± 3.339.6 ± 3.92.2 ± 0.2
Gustafson77.7 ± 2.672.2 ± 3.371.5 ± 4.071.9 ± 2.844.0 ± 5.52.6 ± 0.3
N478.0 ± 1.673.3 ± 2.570.1 ± 2.771.7 ± 1.944.6 ± 3.82.6 ± 0.3
N577.9 ± 1.773.0 ± 2.570.6 ± 2.871.8 ± 1.844.5 ± 3.82.6 ± 0.3
N676.2 ± 1.773.5 ± 2.666.3 ± 4.369.6 ± 2.642.5 ± 4.12.4 ± 0.3
N778.3 ± 1.773.7 ± 2.569.6 ± 2.771.6 ± 1.844.8 ± 3.62.6 ± 0.3
N878.1 ± 1.772.3 ± 2.371.1 ± 3.371.7 ± 2.143.9 ± 3.82.5 ± 0.3
N978.2 ± 1.671.5 ± 2.270.3 ± 4.170.9 ± 2.342.3 ± 3.82.4 ± 0.3
N1078.1 ± 1.672.5 ± 2.470.2 ± 3.371.3 ± 2.143.6 ± 3.92.5 ± 0.3
N1177.7 ± 1.771.9 ± 2.270.0 ± 3.370.9 ± 2.042.6 ± 3.62.5 ± 0.3
N1277.6 ± 1.971.5 ± 2.369.5 ± 4.570.5 ± 2.541.8 ± 4.02.4 ± 0.3
N1377.6 ± 1.773.1 ± 2.469.5 ± 3.071.2 ± 2.143.9 ± 3.92.5 ± 0.3
N8076.9 ± 1.871.1 ± 2.471.5 ± 2.471.3 ± 1.842.4 ± 3.82.5 ± 0.3
Table 16

Performance comparison of our method vs. mRMR for the imbalanced S. cerevisiae data set with the same sizes of feature subsets, where the > symbol represents that the values are significantly higher.

AUCPrecisionRecallF-measureMCC
N40.7440.7620.7820.7560.3270.3310.4610.4610.4390.430
N50.7270.7180.7410.7400.3870.3590.5090.4840.4610.442
N60.7300.7530.7520.7530.395 >0.3330.518 >0.4620.472 >0.430
N70.7610.7630.7670.7610.386 >0.3300.513 >0.4600.473 >0.431
N80.772 >0.7710.7550.7570.371 >0.3260.498 >0.4560.457 >0.427
N90.7820.7760.7490.7490.382 >0.3410.506 >0.4690.462 >0.434
N100.781 >0.7780.7510.7520.399 >0.3400.521 >0.4690.474 >0.434
N110.786 >0.7740.7520.7500.402 >0.3410.524 >0.4690.476 >0.434
N120.798 >0.7810.7590.7570.409 >0.3340.532 >0.4630.485 >0.432
N130.789 >0.7740.7480.7460.433 >0.3420.549 >0.4690.495 >0.432
N140.802 >0.7750.7490.7500.397 >0.3400.519 >0.4680.471 >0.433
N150.801 >0.7980.7630.7640.406 >0.3180.530 >0.4490.485 >0.424
N160.814 >0.7990.7620.7620.401 >0.3180.525 >0.4490.480 >0.423
N170.814 >0.7990.7610.7590.407 >0.3260.530 >0.4560.484 >0.427
N180.811 >0.7970.7510.7490.411 >0.3420.531 >0.4690.482 >0.434
Table 17

Performance comparison of our new method vs. CMIM for the imbalanced S. cerevisiae data set when identical number of features are selected.

AUCPrecisionRecallF1MCC
N40.7440.7610.7820.7620.3270.3440.4610.4740.4390.442
N50.7270.7350.7410.7380.3870.3710.5090.4940.4610.449
N60.7300.7570.7520.7490.3950.3590.5180.4850.4720.446
N70.7610.7790.7670.7630.386 >0.3390.513 >0.4700.473 >0.439
N80.7720.7790.7550.7540.371 >0.3500.498 >0.4780.457 >0.442
N90.7820.7760.7490.7500.382 >0.3570.506 >0.4830.462 >0.445
N100.7810.7820.7510.7510.399 >0.3530.521 >0.4800.474 >0.443
N110.7860.7860.7520.7520.402 >0.3630.524 >0.4900.476 >0.450
N120.7980.7990.7590.7580.409 >0.3540.532 >0.4830.485 >0.447
N130.7890.7970.7480.7500.433 >0.3600.549 >0.4870.495 >0.447
N140.802 >0.8010.7490.7490.397 >0.3480.519 >0.4750.471 >0.438
N150.801 >0.7970.7630.7600.406 >0.3300.530 >0.4600.485 >0.430
N160.814 >0.7960.7620.7590.401 >0.3380.525 >0.4680.480 >0.436
N170.814 >0.7950.7610.7560.407 >0.3390.530 >0.4690.484 >0.435
N180.8110.7990.7510.7560.411 >0.3380.531 >0.4670.482 >0.435
Table 18

Performance comparison of our method vs. mRMR for the balanced S. cerevisiae data set with the same sizes of feature subsets, where the > symbol indicates that the values are significantly higher.

AUCPrecisionRecallF-measureMCC
N40.8110.8150.7770.7700.7160.7250.7450.7470.5120.510
N50.8240.8180.7780.7710.7350.7220.7560.7450.5270.508
N60.8270.8140.7780.7750.7390.7090.7580.7400.5300.504
N70.8310.8240.7790.7790.7330.7180.7550.7470.5260.516
N80.8260.8270.7860.7810.7210.7210.7520.7500.5270.521
N90.8330.8340.7910.7830.7350.7340.7620.7580.5410.531
N100.8340.8350.7890.7830.7360.7330.7610.7570.5400.531
N110.8310.8340.7840.7800.7370.7300.7600.7540.5350.525
N120.8290.8340.7790.7780.7320.7340.7550.7550.526 >0.525
N130.8340.8340.7880.7790.7300.7320.7580.7540.5350.525
N140.8360.8320.7770.7770.7430.7310.7590.7530.5300.522
N150.8430.8350.7840.7780.7480.7340.7660.7560.5420.526
N160.842 >0.8360.7770.7770.756 >0.7350.767 >0.7550.540 >0.525
N170.847 >0.8340.7780.7770.7630.7330.770 >0.7540.545 >0.523
N180.8400.8350.7790.7780.7400.7350.7590.7560.5310.526
Table 19

Performance comparison of our method vs. CMIM for the balanced S. cerevisiae data set when identical number of features are selected.

AUCPrecisionRecallF1MCC
N40.8110.8130.7770.7770.7160.7240.7450.7490.5120.517
N50.8240.8170.7780.7750.7350.7400.7560.7570.5270.526
N60.8270.8210.7780.7770.7390.7420.7580.7590.5300.529
N70.8310.8300.7790.7720.7330.7440.7550.7580.5260.524
N80.8260.8330.7860.7760.7210.7380.7520.7560.5270.525
N90.8330.8340.7910.7750.7350.7400.7620.7570.5410.526
N100.8340.8350.7890.7760.7360.7390.7610.7570.5400.527
N110.8310.8360.7840.7780.7370.7390.7600.7580.5350.528
N120.8290.8380.7790.7790.7320.7420.7550.7600.5260.532
N130.8340.8370.7880.7780.7300.7410.7580.7590.5350.530
N140.8360.8360.7770.7770.7430.7430.7590.7590.5300.530
N150.8430.8360.7840.7770.7480.7390.7660.7580.5420.528
N160.842 >0.8370.7770.7760.7560.7410.767 >0.7580.540 >0.528
N170.847 >0.8380.7780.7770.763 >0.7440.770 >0.7600.545 >0.531
N180.8400.8370.7790.7780.7400.7460.7590.7620.5310.533
Table 20

Performance comparison of our method vs. mRMR for the imbalanced E. coli data set when identical numbers of features are selected.

AUCPrecisionRecallF1MCC
N40.6910.6510.7250.6780.2800.2690.4040.3850.3910.363
N50.6900.6750.7370.6870.2950.2540.4210.3710.4070.356
N60.7010.6810.7420.7080.2870.2200.4140.3360.4030.338
N70.7140.6860.7350.7120.2750.2120.4000.3260.3920.333
N80.7050.6920.7420.7130.2880.2090.4150.3230.4050.330
N90.7070.6920.7260.7130.293 >0.1990.417 >0.3120.4010.322
N100.7110.6970.7240.7030.294 >0.1930.418 >0.3020.4010.313
N110.714 <0.7020.7320.6970.278 >0.1870.4030.2950.3930.306
N120.712 <0.7040.7250.6830.2920.1920.4160.3000.4000.305
N130.714 <0.7150.7330.7130.2870.2500.4130.3700.4000.360
Table 21

Performance comparison of our method vs. CMIM for the imbalanced E. coli data set when identical numbers of features are selected.

AUCPrecisionRecallF1MCC
N40.691 >0.6630.7250.7170.2800.2710.4040.3930.3910.381
N50.6900.6860.7370.7100.295 >0.2640.421 >0.3850.407 >0.373
N60.7010.6970.7420.7150.287 >0.2650.414 >0.3870.403 >0.376
N70.7140.6930.7350.7110.275 >0.2610.400 >0.3820.392 >0.371
N80.7050.6900.7420.7090.288 >0.2540.415 >0.3730.405 >0.364
N90.7070.7010.7260.7200.293 >0.2710.417 >0.3940.401 >0.382
N100.7110.7020.7240.6920.294 >0.2480.418 >0.3640.401 >0.353
N110.7140.6980.7320.6900.278 >0.2470.403 >0.3630.393 >0.351
N120.7120.6900.7250.6830.292 >0.2390.416 >0.3530.400 >0.342
N130.7140.6880.7330.6780.287 >0.2360.413 >0.3490.400 >0.337
Table 22

Performance comparison of our method vs. mRMR for the balanced E. coli data set when identical numbers of features are selected.

AUCPrecisionRecallF1MCC
N40.780 >0.7730.7330.7260.701 >0.6510.717 >0.6860.446 >0.407
N50.7790.7720.7300.7200.706 >0.6540.718 >0.6840.445 >0.401
N60.7620.7710.7350.7170.6630.6490.6960.6800.4250.394
N70.783 >0.7680.7370.7160.6960.6490.716 >0.6800.448 >0.394
N80.781 >0.7640.7230.7150.711 >0.6410.717 >0.6750.439 >0.387
N90.782 >0.7640.7150.7130.7030.6430.709 >0.6750.423 >0.386
N100.781 >0.7650.7250.7160.702 >0.6360.713 >0.6730.436 >0.386
N110.777 >0.7660.7190.7200.7000.6430.709 >0.6780.4260.394
N120.7760.7650.7150.7140.6950.6430.7050.6760.4180.388
N130.776 >0.7620.7310.7280.695 >0.6540.712 >0.6890.439 >0.396
Table 23

Performance comparison of our method vs. CMIM for the balanced E. coli data set when identical numbers of features are selected.

AUCPrecisionRecallF1MCC
N40.7800.7690.7330.7190.7010.6960.7170.7070.4460.424
N50.7790.7710.7300.7150.7060.6960.7180.7050.4450.419
N60.762 <0.7710.7350.7160.6630.6840.6960.6990.4250.413
N70.7830.7690.7370.7110.6960.6960.7160.7030.4480.413
N80.7810.7670.7230.7090.7110.6970.7170.7030.4390.412
N90.7820.7670.7150.7200.7030.7000.7090.7100.4230.421
N100.7810.7670.7250.7050.7020.7020.7130.7040.4360.409
N110.7770.7650.7190.7060.7000.7000.7090.7030.4260.408
N120.7760.7640.7150.7030.6950.6980.7050.7000.4180.404
N130.7760.7650.7310.7040.6950.7000.7120.7020.4390.406
  22 in total

1.  Global snapshot of a protein interaction network-a percolation based approach.

Authors:  Chen-Shan Chin; Manoj Pratim Samanta
Journal:  Bioinformatics       Date:  2003-12-12       Impact factor: 6.937

2.  The Database of Interacting Proteins: 2004 update.

Authors:  Lukasz Salwinski; Christopher S Miller; Adam J Smith; Frank K Pettit; James U Bowie; David Eisenberg
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

Review 3.  Network biology: understanding the cell's functional organization.

Authors:  Albert-László Barabási; Zoltán N Oltvai
Journal:  Nat Rev Genet       Date:  2004-02       Impact factor: 53.242

4.  Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy.

Authors:  Hanchuan Peng; Fuhui Long; Chris Ding
Journal:  IEEE Trans Pattern Anal Mach Intell       Date:  2005-08       Impact factor: 6.226

5.  ROCR: visualizing classifier performance in R.

Authors:  Tobias Sing; Oliver Sander; Niko Beerenwinkel; Thomas Lengauer
Journal:  Bioinformatics       Date:  2005-08-11       Impact factor: 6.937

6.  Predicting essential genes based on network and sequence analysis.

Authors:  Yih-Chii Hwang; Chen-Ching Lin; Jen-Yun Chang; Hirotada Mori; Hsueh-Fen Juan; Hsuan-Cheng Huang
Journal:  Mol Biosyst       Date:  2009-12

7.  A simple method for displaying the hydropathic character of a protein.

Authors:  J Kyte; R F Doolittle
Journal:  J Mol Biol       Date:  1982-05-05       Impact factor: 5.469

8.  A comprehensive two-hybrid analysis to explore the yeast protein interactome.

Authors:  T Ito; T Chiba; R Ozawa; M Yoshida; M Hattori; Y Sakaki
Journal:  Proc Natl Acad Sci U S A       Date:  2001-03-13       Impact factor: 11.205

9.  The importance of bottlenecks in protein networks: correlation with gene essentiality and expression dynamics.

Authors:  Haiyuan Yu; Philip M Kim; Emmett Sprecher; Valery Trifonov; Mark Gerstein
Journal:  PLoS Comput Biol       Date:  2007-02-14       Impact factor: 4.475

10.  Hubba: hub objects analyzer--a framework of interactome hubs identification for network biology.

Authors:  Chung-Yen Lin; Chia-Hao Chin; Hsin-Hung Wu; Shu-Hwa Chen; Chin-Wen Ho; Ming-Tat Ko
Journal:  Nucleic Acids Res       Date:  2008-05-24       Impact factor: 16.971

View more
  1 in total

1.  Prediction of cancer proteins by integrating protein interaction, domain frequency, and domain interaction data using machine learning algorithms.

Authors:  Chien-Hung Huang; Huai-Shun Peng; Ka-Lok Ng
Journal:  Biomed Res Int       Date:  2015-03-17       Impact factor: 3.411

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.