| Literature DB >> 24885721 |
Narjeskhatoon Habibi1, Siti Z Mohd Hashim, Alireza Norouzi, Mohammed Razip Samian.
Abstract
BACKGROUND: Over the last 20 years in biotechnology, the production of recombinant proteins has been a crucial bioprocess in both biopharmaceutical and research arena in terms of human health, scientific impact and economic volume. Although logical strategies of genetic engineering have been established, protein overexpression is still an art. In particular, heterologous expression is often hindered by low level of production and frequent fail due to opaque reasons. The problem is accentuated because there is no generic solution available to enhance heterologous overexpression. For a given protein, the extent of its solubility can indicate the quality of its function. Over 30% of synthesized proteins are not soluble. In certain experimental circumstances, including temperature, expression host, etc., protein solubility is a feature eventually defined by its sequence. Until now, numerous methods based on machine learning are proposed to predict the solubility of protein merely from its amino acid sequence. In spite of the 20 years of research on the matter, no comprehensive review is available on the published methods.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24885721 PMCID: PMC4098780 DOI: 10.1186/1471-2105-15-134
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
A summary of key components of studies to predict protein solubility (in chronological order)
| # | |||||
|---|---|---|---|---|---|
| 1 | [ | Bacterial protein sequences with ‘soluble’ and ‘insoluble’ in NCBI are selected randomly. | Wrapper: SVM | Support vector machine | - |
| Size: 5692 | |||||
| Soluble: 2448 | |||||
| Insoluble: 3244 | |||||
| 2 | [ | HGPD | Filter: Student’s | Two techniques: | ESPRESSO: |
| Support vector machine | |||||
| Size: 5100 | |||||
| Soluble: 1774 | |||||
| Insoluble: 3326 | |||||
| Wheat germ | Sequence pattern-based method | ||||
| Size: 2939 | |||||
| Soluble: 1941 | |||||
| Insoluble: 998 | |||||
| 3 | [ | eSol | Two methods: | Random forest | ProS: |
| Size: 1918 | 1. Filter: Student’s | ||||
| Soluble: 886 | 2. Wrapper: Random forest | ||||
| Insoluble: 1032 | |||||
| 4 | [ | Four datasets: | - | Two methods: | SCM: |
| Sd957 | Support vector machine | ||||
| Dataset Chan et al. [ | Scoring card method (SCM) | ||||
| Solpro | |||||
| PROSO II | |||||
| 5 | [ | eSol | - | Four techniques: | - |
| Size: 1600 | 1. Support vector machine | ||||
| 2. Random forest | |||||
| 3. Conditional inference trees | |||||
| 4. Rule ensemble | |||||
| 6 | [ | PROSO II | Wrapper | A two-layer model: | PROSOII: |
| 1. Layer 1: Parzen window + logistic regression | |||||
| 2. Layer 2: Logistic regression | |||||
| 7 | [ | eSol | - | Decision tree | - |
| Size: 1625 | |||||
| Soluble: 843 | |||||
| Insoluble: 782 | |||||
| 8 | [ | eSol | Wrapper: SVM | Support vector machine | - |
| Size: 2159 | |||||
| Soluble: 1081 | |||||
| Insoluble: 1078 | |||||
| 9 | [ | HGPD | Filer: Student’s | Random forest | - |
| Size: 7823 | |||||
| Soluble: 2796 | |||||
| Insoluble: 5027 | |||||
| Wheat germ | |||||
| Size: 3955 | |||||
| Soluble: 2739 | |||||
| Insoluble: 1216 | |||||
| 10 | [ | SOLP | Seven methods: | Support vector machine | - |
| 1. Filter: Information gain | |||||
| 2. Filter: Gain ratio | |||||
| 3. Filter: Chi squared | |||||
| 4. Filter: Symmetrical uncertainty | |||||
| 5. Wrapper: ReliefF | |||||
| 6. Wrapper: SVM recursive feature elimination (SvmRfe) | |||||
| 7. Embedded: One attribute rule | |||||
| 11 | [ | 121genes from different species were expressed in 6 different vectors. | Feature selection package in LIBSVM: Filter (F-score) + Wrapper (SVM) | Support vector machine | - |
| Size: 726 | |||||
| Soluble: 231 | |||||
| Insoluble: 236 | |||||
| Non-expressed: 259 | |||||
| 12 | [ | A database collected through literature search. | N/A | Logistic regression | |
| Size: 212 | |||||
| Soluble: 52 | |||||
| Insoluble: 160 | |||||
| 13 | [ | Solpro | Wrapper | A two- layer model: | SOLpro: |
| 1. Layer 1: 20 Support vector machines | |||||
| 2. Layer 2: One support vector machine | |||||
| 14 | [ | eSol | Using histogram | Support vector machine | - |
| 15 | [ | PROSO | Two methods: | A two-layer model: | PROSO: |
| 1. Wrapper | Layer 1: Support vector machine | ||||
| 2. Filter: Symmetrical uncertainty | Layer 2: Naive Bayes | ||||
| 16 | [ | Idicula‒Thomas 2006 | N/A | Support vector machine | - |
| 17 | [ | Idicula‒Thomas 2006 | Filter: Unbalanced correlation score | Support vector machine | - |
| 18 | [ | Idicula‒Thomas 2005 | Filter: Mann–Whitney test | Discriminant analysis (A heuristic approach of computing solubility index (SI)) | - |
| 19 | [ | Genes of | Filter: Linear correlation coefficient (LCC) | - | |
| Size: 4854 | |||||
| Soluble: 1536 | |||||
| Insoluble: 3318 | |||||
| 20 | [ | TargetDB | Wrapper: Random forest | Decision tree | - |
| Size: 27,000 | |||||
| 21 | [ | SPINE | Wrapper | Decision tree | - |
| Size: 562 | |||||
| 22 | [ | SPINE | Embedded: Decision tree | Decision tree | - |
| Size: 356 | |||||
| Soluble: 213 | |||||
| Insoluble: 143 | |||||
| 23 | [ | Some genes | N/A | Regression | - |
| Size: 100 | |||||
| 24 | [ | Some genes | N/A | Regression | - |
| Size: 81 |
Reported prediction performances of the models (in chronological order)
| 1 | [ | 0.88 | - | - | - | 0.76 | - | - | - | - |
| 2* | [ | 0.68 | 0.78 | 0.67 | - | 0.42 | 0.56 | 0.85 | - | - |
| 0.75 | 0.75 | 0.82 | - | 0.42 | 0.79 | 0.86 | - | - | ||
| 3 | [ | 0.84 | 0.91 | - | - | 0.67 | - | - | 0.82 | 0.85 |
| 4 | [ | 0.84 | - | - | - | - | - | - | - | - |
| 5 | [ | 0.90 | - | - | - | - | - | - | 0.80 | 0.80 |
| 6 | [ | 0.75 | - | - | 1.69 | 0.39 | 0.65 | 0.76 | 0.73 | |
| 7 | [ | 0.75 | 0.81 | - | - | - | - | - | - | - |
| 8 | [ | - | - | - | - | - | - | - | - | - |
| 9* | [ | 0.71 | - | - | - | - | 0.47 | 0.67 | - | - |
| 0.71 | - | - | - | - | 0.85 | 0.74 | - | - | ||
| 10 | [ | - | - | - | - | - | - | - | - | - |
| 11 | [ | 0.83 | 0.89 | 0.75 | - | - | 0.73 | 0.78 | - | - |
| 12 | [ | 0.94 | - | - | - | - | - | - | - | - |
| 13 | [ | 0.74 | 0.74 | - | 1.49 | 0.49 | 0.74 | 0.74 | - | - |
| 14 | [ | 0.80 | - | - | - | - | - | - | - | - |
| 15 | [ | 0.72 | 0.78 | - | 1.43 | 0.43 | - | 0.72 | - | - |
| 16 | [ | 0.79 | 0.76 | - | - | - | - | - | 0.68 | 0. 85 |
| 17 | [ | 0.74 | - | - | - | - | - | - | 0.57 | 0.81 |
| 18 | [ | 0.72 | - | - | - | - | - | - | - | - |
| 19 | [ | - | - | - | - | - | - | |||
| 20 | [ | 0.76 | - | - | - | - | - | - | - | - |
| 21 | [ | 0.63 | - | - | - | - | - | - | - | - |
| 22 | [ | 0.65 | - | - | - | - | - | - | - | - |
| 23 | [ | - | - | - | - | - | - | - | - | - |
| 24 | [ | 0.88 | - | - | - | - | - | - | - | - |
a. *Results for E. coli and wheat germ are shown respectively.
Features used to predict protein solubility
| 1 | [ | 1. 2-level triangle CGR |
| 2. Entropy of “2-level triangle CGR” | ||
| 3. Dipeptide composition based on a different mode of pseudo amino acid composition (PseAAC) | ||
| 4. Entropy of “dipeptide composition” | ||
| 2 | [ | Same as row 9 (Reference [ |
| 3 | [ | 1. Counts of aromatic amino acids |
| 2. Counts of buried amino acids | ||
| 3. Counts of hydrogen bonds | ||
| 4. Counts of leucine amino acid | ||
| 5. Counts of arginine amino acid | ||
| 6. Negative charge | ||
| 7. Surface composition of amino acids in intracellular proteins of Mesophiles (percent) | ||
| 8. Beta-strand indices for beta-proteins | ||
| 9. Flexibility parameter for two rigid neighbours | ||
| 10. Net charge | ||
| 11. Counts of nitrogen atoms | ||
| 12. Long range non-bonded energy per atom | ||
| 13. Isometric point (pI) | ||
| 14. Free energies of transfer of AcWl-X-LL peptides from bilayer interface to water | ||
| 15. Ratio of negative charge amino acids | ||
| 16. Ratio of net charge of protein | ||
| 17. Dependence of partition coefficient on ionic strength | ||
| 4 | [ | Dipeptide composition (400 features) |
| 5 | [ | 1. Reduced features (39 features produced by pepstats): |
| a. Molecular weight, number of residues, average residue weight, charge and isoelectric point | ||
| b. For each type of amino acid: number, molar percent and DayhoffStat | ||
| c. For each physicochemical class of amino acid: number, molar percent, molar extinction coefficient (A280) and extinction coefficient at 1 mg/ml (A280) | ||
| 2. Dimers (2400 features): | ||
| a. Dimers amino acid frequencies which are computed considering gaps of 1–5 amino acid | ||
| 3. Complete set | ||
| a. Reduced features + Dimers | ||
| 6 | [ | 1. Amino acid frequencies (18 features): R, N, D, C, Q, E, G, H, I, K, M, F, P, S, T, W, Y, V |
| 2. Dipeptide frequencies (13 features): AK, CV, EG, GN, GH, HE, IH, IW, MR, MQ, PR, TS, WD | ||
| 7 | [ | 1. Monomer, dimer and trimmers using 7 different alphabets (18 features) |
| 2. Sequence-computed features: | ||
| a. Molecular weight | ||
| b. Sequence length | ||
| c. Isoelectric point | ||
| d. GRAVY index | ||
| 3. Features used in Niwa et al. work [ | ||
| 4. Combination of all the above features 1–3. | ||
| 8 | [ | 1. Coil |
| 2. Disorder | ||
| 3. Hydrophobicity | ||
| 4. Hydrophilicity | ||
| 5. β-turn | ||
| 6. α-helix | ||
| 9 | [ | 1. Nucleotide sequence information: |
| a. 1-mer | ||
| b. Frequencies of 64 codons (3-mer) | ||
| c. GC-contents | ||
| 2. Amin acid sequence information: | ||
| a. Polypeptide length | ||
| b. Frequencies of 20 single amino acids (1-mer) | ||
| c. Frequencies of 8 chemical property groups | ||
| d. Frequencies of 5 physical property groups | ||
| e. Repeat of amino acids | ||
| f. Repeat of 8 chemical property groups | ||
| g. Repeat of 5 physical property groups | ||
| 3. Amino acid structural information: | ||
| a. Frequencies of single amino acids in surface area | ||
| b. Frequencies of 8 chemical property groups in surface area | ||
| c. Frequencies of 5 physical property groups in surface area | ||
| d. Number of transmembrane regions | ||
| e. Disordered regions: | ||
| i. Number of occurrence | ||
| ii. Length | ||
| iii. Proportion | ||
| f. Secondary structures: | ||
| i. alpha-helix | ||
| ii. Beta-sheet | ||
| iii. Others | ||
| 10 | [ | 1497 features computed by Protein Feature Server (PROFEAT) [ |
| 1. Group 1: | ||
| a. Amino acid composition | ||
| b. Dipeptide composition | ||
| 2. Group 2: Autocorrelation 1 | ||
| a. Normalized Moreau-Broto autocorrelation | ||
| 3. Group 3: Autocorrelation 2 | ||
| a. Moran autocorrelation | ||
| 4. Group 4: Autocorrelation 3 | ||
| a. Geary autocorrelation | ||
| 5. Group 5: | ||
| a. Composition | ||
| b. Transition | ||
| c. Distribution | ||
| 6. Group 6: Sequence order 1 | ||
| a. Sequence-order-coupling number | ||
| b. Quasi-sequence-order descriptors | ||
| 7. Group 7: Sequence order 2 | ||
| a. Pseudo amino acid descriptors | ||
| 11 | [ | 1. Nucleotide information: |
| a. 1-mer | ||
| b. 2-mer | ||
| c. 3-mer | ||
| d. Sequence length | ||
| e. GC content | ||
| 2. Amino Acid information: | ||
| a. Features of Wilkinson and Harrison [ | ||
| b. Features of Idicula-Thomas et al. [ | ||
| c. Isoelectric point | ||
| d. Peptide statistics | ||
| 3. Codon Adaptation Index | ||
| 4. PTMs | ||
| 12 | [ | 1. Molecular weight |
| 2. Cysteine fraction | ||
| 3. Hydrophobicity-related parameters: | ||
| a. Fraction of total number of hydrophobic amino acids | ||
| b. Fraction of largest number of contiguous hydrophobic/hydrophilic amino acids | ||
| 4. Aliphatic index | ||
| 5. Secondary structure-related properties: | ||
| a. Proline fraction | ||
| b. Alpha-helix propensity | ||
| c. Beta-sheet Propensity | ||
| d. Turn-forming residue fraction | ||
| e. Alpha-helix propensity/b-sheet propensity | ||
| 6. Protein–solvent interaction related parameters: | ||
| a. Hydrophilicity index | ||
| b. pI | ||
| c. Approximate charge average | ||
| 7. Fractions of: Alanine, Arginine, Asparagine, Aspartate, Glutamate, Glutamine, Glycine, Histidine, Isoleucine, Leucine, Lysine, Methionine, Phenylalanine, Serine, Threonine, Tyrosine, Tryptophan and Valine | ||
| 13 | [ | 1. Frequencies of amino acid monomers, dimers and trimmers using 7 different alphabets: |
| a. Monomer frequencies | ||
| i. [Natural-20:M] | ||
| ii. [ClustEM-17:M] | ||
| iii. [ClustEM-14:M] | ||
| iv. [PhysChem-7:M] | ||
| v. [BlosumSM-8:M] | ||
| vi. [ConfSimi-7:M] | ||
| vii. [Hydropho-5:M] | ||
| b. Dimer frequencies | ||
| i. [PhysChem-7:D] | ||
| ii. [ClustEM-14:D] | ||
| iii. [ClustEM-17:D] | ||
| iv. [BlosumSM-8:D] | ||
| v. [Natural-20:D] | ||
| vi. [ConfSimi-7:D] | ||
| c. Trimmer frequencies | ||
| i. [ClustEM-17:T] | ||
| ii. [Hydropho-5:T] | ||
| iii. [ConfSimi-7:T] | ||
| iv. [ClustEM-14:T] | ||
| v. [Natural-20:T] | ||
| 2. Features computed directly: | ||
| a. Sequence length | ||
| b. Turn-forming residues fraction | ||
| c. Absolute charge per residue | ||
| d. Molecular weight | ||
| e. GRAVY index | ||
| f. Aliphatic index | ||
| 3. Predicted features using the SCRATCH suite of predictors: | ||
| a. Beta residues fraction (Predicted by SSpro) | ||
| b. Alpha residues fraction (Predicted by SSpro) | ||
| c. Number of domains (Predicted by DOMpro) | ||
| d. Exposed residues fraction (Predicted by ACCpro, using a 25% relative solvent accessibility cut-off) | ||
| 14 | [ | 1. Molecular weight |
| 2. Isometric point (pI) | ||
| 3. Ratios of each amino acid content | ||
| 15 | [ | 4. For mono-domain proteins: |
| a. Word size 1: | ||
| S, IL, M, F, DE, A, C, G, R | ||
| b. Word size 2: | ||
| R + R, R + C, R + E, R + T, N + Q, N + H, N + L, C + S, Q + A, Q + G, Q + I, E + A, E + G, E + K, E + P, E + V, G + P, H + M, L + Y, K + G, K + K, M + G, S + S, T + I, Y + C, Y + I | ||
| c. Word size 3: | ||
| ST + ST + ST, ST + ST + N, ST + DQE + AH, ST + C + ST, G + M + R, G + K + G, G + P + G, | ||
| G + P + N, M + AH + AH, M + C + Y, DQE + G + R, DQE + R + DQE, DQE + M + ST, | ||
| DQE + Y + N, DQE + AH + IV, K + R + IV, K + K + ST, P + DQE + DQE, P + DQE + C, | ||
| IV + G + IV, L + IV + DQE, N + FW + DQE, N + C + P, AH + ST + ST, AH + K + L, C + FW + Y, C + K + C | ||
| 5. For multi-domain proteins: | ||
| a. Word size 1: | ||
| R, D, C, E, G, L, K, M, S, W | ||
| b. Word size 2: | ||
| A + Y, A + V, R + N, R + E, R + S, R + Y, N + A, D + M, C + T, Q + A, Q + E, E + D, E + G, E + T, G + I, | ||
| G + F, G + S, H + C, H + M, H + P, L + G, L + S, K + D, K + G, K + L, K + F, P + L, T + L, T + Y, V + R | ||
| c. Word size 3: | ||
| ST + ST + ST, ST + P + DQE, ST + IV + K, R + DQE + FW, R + DQE + IV, R + IV + FW, | ||
| FW + DQE + FW, M + ST + DQE, M + G + AH, M + FW + DQE, DQE + ST + ST, | ||
| DQE + ST + G, DQE + G + K, DQE + IV + R, DQE + IV + L, P + G + ST, IV + ST + P, | ||
| L + K + FW, AH + ST + IV, AH + G + IV, AH + AH + M | ||
| 16 | [ | 1. Aliphatic index |
| 2. Frequency of occurrence of residues Cysteine (Cys), Glutanic acid (Glu), Asparagine (Asn) and Tyrosine (Tyr) | ||
| 3. Reduced class of conformational similarity [CMQLEKRA] | ||
| 4. Reduced classes of hydrophobicity [CFILMVW] and [NQSTY] | ||
| 5. Reduced classes of BLOSUM50 substitution matrix [CILMV] | ||
| 6. The 18 dipeptide composition: [VC], [AE], [VE], [WF], [YF], [AG], [FG], [WG], [HH], [MI], [HK], [KN], [KP], [ER], [YS], [RV], [KY], [TY] | ||
| 17 | [ | 1. Physicochemical properties (6 features): |
| a. Length of protein | ||
| b. Hydropathy index (GRAVY) | ||
| c. Aliphatic index | ||
| d. Instability index | ||
| e. Instability index of N-terminus | ||
| f. Net charge | ||
| 2. Mono-peptide frequencies (20 features) | ||
| 3. Dipeptide frequencies (400 features) | ||
| 4. Reduced alphabet set (20 features) | ||
| 18 | [ | 1. Aliphatic index (AI) |
| 2. Instability index of the N terminus | ||
| 3. Frequency of occurrence of Asn, Thr, and Tyr | ||
| 4. Tri-peptide score | ||
| 19 | [ | 1. Signal peptide |
| 2. GRAVY | ||
| 3. Transmembrane helices | ||
| 4. Number of Cysteines | ||
| 5. Anchor peptide | ||
| 6. Prokaryotic membrane lipoprotein lipid attachment site | ||
| 7. PDB identity | ||
| 20 | [ | 1. General sequence composition |
| 2. Clusters of orthologous groups (COG) assignment | ||
| 3. Length of hydrophobic stretches | ||
| 4. Number of low-complexity regions | ||
| 5. Number of interaction partners | ||
| 21 | [ | 1. Single residue composition: I, T, Y |
| 2. Combined amino acid compositions: KR, DE, DENQ | ||
| 3. Predicted secondary structure composition: α and coil | ||
| 4. Presence of signal sequence | ||
| 5. Amino acid sequence length | ||
| 6. Number of amino acids in both short and long low complexity regions (over sequence length) | ||
| 7. Normalized low complexity value for both short and long regions (over sequence length) | ||
| 8. Minimum GES hydrophobicity score calculated over all amino acids in a 20 residue sequence window | ||
| 22 | [ | 1. Hydrophobe |
| 2. Cplx: a measure of a short complexity region based on the SEG program. | ||
| 3. Gln composition | ||
| 4. Asp + Glu composition | ||
| 5. Ile-composition | ||
| 6. Phe + Tyr + Trp composition | ||
| 7. Gly + Ala + Val + Leu + Ile composition | ||
| 8. His + Lys + Arg composition | ||
| 9. Trp composition | ||
| 10. Alpha-helical secondary structure composition | ||
| 23 | [ | Same as row 24 (Reference [ |
| 24 | [ | 1. Charge average approximation (Asp, Glu, Lys and Arg) |
| 2. Turn-forming residue fraction (Asn, Gly, Pro and Ser) | ||
| 3. Cysteine fractions | ||
| 4. Proline fractions | ||
| 5. Hydrophilicity | ||
| 6. Molecular weight (Total number of residues) |
Databases/datasets used to predict protein solubility (in chronological order)
| 1 | Sd957 | [ | 957 | 285 | 672 | It is made from 3 previous datasets: Idicula-Thomas et al. [ | |
| 2 | PROSO II | [ | 82,000 | 41,000 | 41,000 | It is made from pepcDB and PDB and has been the largest dataset ever. It is balanced. | |
| 3 | HGPD | [ | 17,821 (As of June 9th, 2011) | N/A | N/A | Human full-length cDNA. | |
| 4 | eSol | [ | 30,173 | N/A | N/A | A database on the solubility of entire ensemble of | |
| 5 | Solpro (SOLP) | [ | 17,408 | 8704 | 8704 | It is collected from 4 different sources: PDB, SwissProt, TargetDB and dataset of “Idicula-Thomas, 2006”. The sequence redundancy is removed with 25% sequence similarity. It is balanced. | |
| 6 | PROSO | [ | 14,000 | 7000 | 7000 | It is collected by merging 4 datasets: TargetDB, | - |
| 7 | pepcDB | [ | N/A | N/A | N/A | It stored target and protocol information contributed by Protein Structure Initiative centres as well as targets imported from the TargetDB database. Now it has been replaced by TargetTrack. | |
| 8 | Idicula-Thomas 2006 | [ | 192 | 62 | 139 | It is collected from the literature. | - |
| 9 | Idicula-Thomas 2005 | [ | 174 | 41 | 133 | It is collected from the literature. | - |
| 10 | PDB | [ | 91,359 (As of 11 June 2013) | N/A | N/A | It is a repository of information about the 3D structures of large biological molecules, including proteins and nucleic acids. | |
| 11 | SPINE | [ | N/A | N/A | N/A | N/A | |
| 12 | TargetDB | [ | 295,041 (As of 29 March 2013) | N/A | N/A | It provided status information on target sequences and tracks their progress through the various stages of protein production and structure determination. Now it has been replaced by TargetTrack. | |
| 13 | TargetTrack | - | 316,424 (As of 14 June 2013) | N/A | N/A | It is a target registration database which provides information on the experimental progress and status of targets selected for structural determination by the Protein Structure Initiative and other worldwide high-throughput structural biology projects. | |
Description of feature selection methods used in machine learning[37]
| Filter | Filter methods evaluate the relatedness of features by looking at the inherent properties of the data. Usually a feature relevance score is computed, and the features with low scores are discarded. | Student’s |
| Information gain [ | ||
| Gain ratio [ | ||
| Chi squared [N/A] | ||
| Symmetrical uncertainty [ | ||
| Unbalanced correlation score [ | ||
| Mann–Whitney test [ | ||
| Linear correlation coefficient [N/A] | ||
| Wrapper | In wrapper methods various subsets of features are evaluated by training and testing a specific classification model, so a search algorithm is ‘wrapped’ around the classification model. This approach adapted to a specific classification algorithm. | Sequential forward selection [ |
| Sequential backward elimination [ | ||
| Beam search [ | ||
| ReliefF [ | ||
| Embedded | Embedded methods, build the search for an optimal subset of features into the classifier construction, so they are specific to a given learning algorithm. | Random forest [ |
| SVM recursive feature elimination (SvmRfe) [ | ||
| One attribute rule [ |
Performance measures used to evaluate protein solubility prediction (in alphabetical order)
| 1 | Accuracy | ACC | (TP + TN)/(TP + TN + FP + FN) | The number of correctly classified instances divided by the total number of instances [ |
| 2 | Area under ROC curve | AUC | - | It measures the discriminating ability of the model and it takes values between 0.5 for random drawing and 1.0 for perfect classifier [ |
| 3 | Enrichment Factor | EF | [CS/(CS + WS)]/[S/(S + I)] | EF is especially suitable for the unbalanced datasets [ |
| CS: Number of correctly classified soluble proteins. | ||||
| WS: Number of soluble proteins wrongly classified as insoluble. | ||||
| S: total number of soluble proteins. | ||||
| I: total number of insoluble proteins. | ||||
| 4 | False Negative | FN | - | The number of incorrectly predicted negatives [ |
| 5 | False Positive | FP | - | The number of incorrectly predicted positives [ |
| 6 | F-Score | FS | 2 × Precision × Recall/(Precision + Recall) | The harmonic mean of recall and precision [ |
| 7 | Gain | GAIN | Precision/proportion of the given class in the full data set. | It is an important performance measure that quantifies how much better the decision is in comparison with random drawing of instances [ |
| 8 | Matthew’s Correlation Coefficient | MCC | (TP × TN - FP × FN)/((TP + FP)(TP + FN)(TN + FP)(TN + FN)) | It indicates the correlation between the classifier assignments and the actual class in the two-class case. It is a good measure of classifier performance even when classes are unbalanced [ |
| 9 | Precision (Selectivity) | PRC | TP/(TP + FP) Or TN/(TN + FN) | The ratio of the number of correctly classified positive or negative instances to the number of all instances classified as positive or negative, for positive and negative class respectively [ |
| 10 | ROC Curve | ROC | Plotting the “FP-rate” against the “TP- rate”, while the probability is increased from 0 to 1.0 with 0.01 increments. | The receiver-operator characteristic curve, showing the trade-off between the ratio of false positives and false negatives in testing a classifier [ |
| 11 | Recall | REC | TP/(TP + FN) | The ratio of the number of correctly classified positive instances to the number of all instances from the positive class [ |
| (Sensitivity) | ||||
| (True positive rate) | ||||
| (TP- rate) | ||||
| 12 | Specificity | SPC | TN/(TN + FP) | The ratio of the number of correctly classified negative instances to the sum of all negative instances [ |
| (True Negative Rate) | ||||
| (TN-rate) | ||||
| 13 | True Positive | TP | - | The number of correctly predicted positives [ |
| 14 | True Negative | TN | - | The number of correctly predicted negatives [ |
a. “TP” = True Positive; “TN” = True Negative; “FP” = False Positive; “FN” = False Negative; “+” = Add, “-” = Subtract; “×” = Multiply; “/” = Division.