| Literature DB >> 24266945 |
Rakesh Kaundal, Sitanshu S Sahu, Ruchi Verma, Tyler Weirick.
Abstract
BACKGROUND: Plastids are an important component of plant cells, being the site of manufacture and storage of chemical compounds used by the cell, and contain pigments such as those used in photosynthesis, starch synthesis/storage, cell color etc. They are essential organelles of the plant cell, also present in algae. Recent advances in genomic technology and sequencing efforts is generating a huge amount of DNA sequence data every day. The predicted proteome of these genomes needs annotation at a faster pace. In view of this, one such annotation need is to develop an automated system that can distinguish between plastid and non-plastid proteins accurately, and further classify plastid-types based on their functionality. We compared the amino acid compositions of plastid proteins with those of non-plastid ones and found significant differences, which were used as a basis to develop various feature-based prediction models using similarity-search and machine learning.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24266945 PMCID: PMC3851450 DOI: 10.1186/1471-2105-14-S14-S7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Plastid and its various types with their respective organelle function.
Number of protein sequences for plastids and non-plastid class used in phase-I (identification) training/testing
| Type | Available | < 30% cutoff | < 30% cutoff (across class) | 10% independent test set | Training set |
|---|---|---|---|---|---|
| Plastids | 17514 | 3535 | 3160 | 316 | 2844 |
| Non-plastids | 17514 | 3191 | 3160 | 316 | 2844 |
Number of sequences available for plastid types in various online databases
| UniProt | NCBI | PLprot | PPDB | |
|---|---|---|---|---|
| Chloroplast | 15203 | 47346 | 690 | 2115 |
| Chromoplast | 75 | 91 | 143 | 11 |
| Etioplast | 56 | 21 | 240 | 0 |
| Amyloplast | 78 | 106 | 0 | 0 |
| Leucoplast | 2 | 3 | 0 | 0 |
| Elaioplast | 1 | 1 | 0 | 0 |
| Proteinoplast | 0 | 0 | 0 | 0 |
Number of protein sequences for various plastid types used in phase-II (classification) training/testing
| Plastid type | Available | < 30% cutoff | 10% independent test set | Training set |
|---|---|---|---|---|
| Chloroplast | 690 | 602 | 60 | 542 |
| Chromoplast | 220 | 194 | 17 | 177 |
| Etioplast | 270 | 244 | 24 | 220 |
| Amyloplast | 313 | 255 | 23 | 232 |
Physicochemical properties used to represent a protein for predicting plastids and their types using SVM.
| Physicochemical property | Amino acids | # feature | |
|---|---|---|---|
| 1 | Charged residues | D, R, E, K, H | 1 |
| 2 | Hydrophilic (polar) and neutral | N, Q, S, T, Y | 1 |
| 3 | Basic polar or Positively charged | H, K, R | 1 |
| 4 | Acidic polar or Negatively charged | D, E | 1 |
| 5 | Aliphatic | A, G, I, L, V | 1 |
| 6 | Aromatic | F, W, Y | 1 |
| 7 | Small | T, D, N | 1 |
| 8 | Tiny | G, A, S, P | 1 |
| 9 | Large | F, R, W, Y | 1 |
| 10 | Hydrophobic (non-polar) and aromatic | W, F | 1 |
| 11 | Hydrophobic (non-polar) and neutral | A, C, G, I, L, M, F, P, W, V | 1 |
| 12 | Amidic ( | N, Q | 1 |
| 13 | Cyclic | P | 1 |
| 14 | Hydroxylic | S, T | 1 |
| 15 | Sulfur-containing | C, M | 1 |
| 16 | H-bonding | C, W, N, Q, S, T, Y, K, R, H, D, E | 1 |
| 17 | Acidic and their Amide | D, E, N, Q | 1 |
| 18 | Ionizable | D, E, H, C, Y, K, R | 1 |
| 19 | Forms covalent cross-link (disulfide bond) | C | 1 |
| 20 | Theoretical pI (isoelectric point) | - | 1 |
Overall performance of homology-based (PSI-BLAST) prediction for the identification of plastid vs. non-plastid proteins and the classification of diverse plastid-types.
| No. of sequences | H | C | P | A | |
|---|---|---|---|---|---|
| 2844 | 2731 | 1443 | 52.84 | 50.74 | |
| 2844 | 2726 | 2337 | 85.73 | 82.17 | |
| 542 | 483 | 167 | 34.58 | 30.81 | |
| 177 | 172 | 17 | 9.88 | 9.61 | |
| 220 | 204 | 4 | 1.96 | 1.82 | |
| 232 | 219 | 42 | 19.18 | 18.10 | |
*at e-value = 0.001; H = Number of total hits; C = Number of correct or true hits; P = Percent of correct hits calculated as (C/H*100); A = Percent accuracy calculated as (C/total number of proteins in a particular class * 100).
Figure 2A comaprative bar-graph of amino acid composition differences in plastid and non-plastid proteins.
Overall performance of various feature classifiers in 5-fold cross-validation for the identification of plastid vs. non-plastid proteins (phase-I)
| Feature type | Sensitivity | Specificity | Accuracy | MCC | Precision (%) | RFP (%) | SVM kernel type |
|---|---|---|---|---|---|---|---|
| 85.37 | 85.65 | 85.51 | 0.71 | 85.61 | 14.39 | RBF ( | |
| 89.45 | 82.95 | 86.20 | 0.73 | 83.99 | 16.01 | RBF ( | |
| 88.08 | 85.51 | 86.80 | 0.74 | 85.88 | 14.12 | RBF ( | |
| 84.14 | 89.66 | 86.90 | 0.74 | 89.06 | 10.94 | RBF ( | |
| 79.57 | 81.05 | 80.31 | 0.61 | 80.76 | 19.24 | RBF ( |
*best values obtained at ≥ 0.0 threshold, individual performance of these classifiers can be seen in supplementary material; AAC = amino acid composition, PseAA = Pseudo amino acid composition, Dipep = Dipeptide composition, NCC = Nterminal-Center-Cterminal composition (sequence divided into 3 parts), Phys-Chem = Protein physicochemical properties, MCC = Matthews Correlation Coefficient, RFP = Rate of False Predictions, RBF = Radial Basis Function of SVM.
Figure 3ROC curve for all five classifiers (AAC, PseAAC, DIPEP, NCC, PhysicoChem) in phase-I prediction; plastid vs. non-plastid proteins identification. AUC = Area Under Curve, AAC = amino acid composition, PseAAC = pseudo amino acid composition, DIPEP = dipeptide composition, NCC = Nterminal-Center-Cterminal composition, PhysicoChem = Protein physicochemical properties.
Overall performance of various feature classifiers on an 'independent test' dataset for the identification of plastid vs. non-plastid proteins (phase-I)
| Feature type | Sensitivity | Specificity | Accuracy | MCC | Precision (%) | RFP (%) |
|---|---|---|---|---|---|---|
| 69.30 | 87.03 | 78.16 | 0.57 | 84.23 | 15.77 | |
| 68.35 | 87.34 | 77.85 | 0.57 | 84.38 | 15.62 | |
| 60.44 | 92.72 | 76.58 | 0.56 | 89.25 | 10.75 | |
| 65.82 | 87.97 | 76.90 | 0.55 | 84.55 | 15.45 | |
| 68.35 | 84.49 | 76.42 | 0.54 | 81.51 | 18.49 |
*individual performance of these classifiers can be seen in supplementary material; AAC = amino acid composition (best values at ≥ 0.0 threshold), PseAA = Pseudo amino acid composition (best values at ≥ 0.1 threshold), Dipep = Dipeptide composition (best values at ≥ 0.2 threshold), NCC = Nterminal-Center-Cterminal composition (sequence divided into 3 parts), Phys-Chem = Protein physicochemical properties, MCC = Matthews Correlation Coefficient, RFP = Rate of False Predictions.
Figure 4A comparative bar-graph of amino acid composition differences among various plastid-types; amyloplast, chromoplast, chloroplast and etioplast proteins.
Overall performance of various feature classifiers in 5-fold cross-validation for the classification of diverse plastid-types* (phase-II)
| Feature type | Sensitivity | Specificity | Accuracy | MCC | Precision (%) | ER (%) | SVM kernel type |
|---|---|---|---|---|---|---|---|
| 60.03 | 76.05 | 77.45 | 0.40 | 59.00 | 22.55 | RBF ( | |
| 60.72 | 77.13 | 78.01 | 0.41 | 59.55 | 21.99 | RBF ( | |
| 62.26 | 75.85 | 78.60 | 0.44 | 62.62 | 21.40 | RBF ( | |
| 60.97 | 77.34 | 78.39 | 0.42 | 58.51 | 21.61 | RBF ( | |
| 56.70 | 78.01 | 76.56 | 0.36 | 54.15 | 23.44 | RBF ( |
*classification of 4 plastid types: chloroplast, chromoplast, etioplast, amyloplast; individual performance of these classifiers on each class can be seen in supplementary material; AAC = amino acid composition, PseAA = Pseudo amino acid composition, Dipep = Dipeptide composition, NCC = Nterminal-Center-Cterminal composition (sequence divided into 3 parts), Phys-Chem = Protein physicochemical properties, MCC = Matthews Correlation Coefficient, ER = Error Rate, RBF = Radial Basis Function of SVM.
Figure 5ROC curves for the best classifier (Dipeptide composition-based) in phase-II prediction, i.e. classification of various plastid types (chloroplast, chromoplast, etioplast, amyloplast). Values in parentheses represent Area Under Curve (AUC).
Overall performance of various feature classifiers on an 'independent test' dataset for the classification of diverse plastid-types* (phase-II)
| Feature type | Sensitivity | Specificity | Accuracy | MCC | Precision (%) | ER |
|---|---|---|---|---|---|---|
| 57.26 | 63.89 | 72.54 | 0.30 | 62.45 | 27.47 | |
| 57.26 | 63.88 | 72.48 | 0.31 | 65.25 | 27.52 | |
| 61.29 | 65.96 | 74.97 | 0.40 | 73.97 | 25.03 | |
| 61.29 | 75.82 | 77.15 | 0.40 | 60.42 | 22.85 | |
| 45.97 | 65.30 | 66.63 | 0.14 | 47.03 | 33.37 |
*classification of 4 plastid types: chloroplast, chromoplast, etioplast, amyloplast; individual performance of these classifiers on each class can be seen in supplementary material; AAC = amino acid composition, PseAA = Pseudo amino acid composition, Dipep = Dipeptide composition, NCC = Nterminal-Center-Cterminal composition (sequence divided into 3 parts), Physico-Chem = Protein physicochemical properties, MCC = Matthews Correlation Coefficient, ER = Error Rate.
Overall performance comparison of our method with the existing web tools for predicting plastid proteins.
| Tools | Sensitivity | Specificity | Accuracy | MCC | Precision (%) | RFP (%) |
|---|---|---|---|---|---|---|
| 56.96 | 74.76 | 65.82 | 0.3223 | 69.50 | 30.50 | |
| 55.70 | 85.89 | 65.97 | 0.3998 | 88.44 | 11.56 | |
| 36.39 | 98.42 | 67.41 | 0.4438 | 95.83 | 4.17 | |
| 34.81 | 97.47 | 66.14 | 0.4142 | 93.22 | 6.78 | |
| 60.44 | 92.72 | 76.58 | 0.56 | 89.25 | 10.75 | |
| 65.82 | 87.97 | 76.90 | 0.55 | 84.55 | 15.45 |
Performance comparison done on an 'independent dataset' that contains 316 plastid and 316 non-plastid proteins. MCC = Matthews Correlation Coefficient, RFP = Rate of False Predictions, DIPEP = Dipeptide composition-based classifier, NCC = Nterminal-Center-Cterminal composition-based classifier.