| Literature DB >> 29176736 |
Thales Francisco Mota Carvalho1, José Cleydson F Silva1,2, Iara Pinheiro Calil2, Elizabeth Pacheco Batista Fontes3, Fabio Ribeiro Cerqueira4,5.
Abstract
Ribosomal proteins (RPs) play a fundamental role within all type of cells, as they are major components of ribosomes, which are essential for translation of mRNAs. Furthermore, these proteins are involved in various physiological and pathological processes. The intrinsic biological relevance of RPs motivated advanced studies for the identification of unrevealed RPs. In this work, we propose a new computational method, termed Rama, for the prediction of RPs, based on machine learning techniques, with a particular interest in plants. To perform an effective classification, Rama uses a set of fundamental attributes of the amino acid side chains and applies a two-step procedure to classify proteins with unknown function as RPs. The evaluation of the resultant predictive models showed that Rama could achieve mean sensitivity, precision, and specificity of 0.91, 0.91, and 0.82, respectively. Furthermore, a list of proteins that have no annotation in Phytozome v.10, and are annotated as RPs in Phytozome v.12, were correctly classified by our models. Additional computational experiments have also shown that Rama presents high accuracy to differentiate ribosomal proteins from RNA-binding proteins. Finally, two novel proteins of Arabidopsis thaliana were validated in biological experiments. Rama is freely available at http://inctipp.bioagro.ufv.br:8080/Rama .Entities:
Mesh:
Substances:
Year: 2017 PMID: 29176736 PMCID: PMC5701237 DOI: 10.1038/s41598-017-16322-4
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
IG values for the datasets composed of RPs and NRPs.
| Attribute |
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| Aromatic | 0.0949 (4°) | 0.0509 (4°) | 0.0307 (8°) | 0.0249 (8°) | 0.0622 (5°) | 0.00987 (9°) |
| Hydrophobicity | 0.0554 (7°) | 0.0418 (7°) | 0.0623 (5°) | 0.0649 (5°) | 0.0670 (4°) | 0.05021 (5°) |
| Molecular mass | 0.0313 (8°) | 0.0299 (8°) | 0.0655 (4°) | 0.0439 (6°) | 0.0127 (8°) | 0.06119 (4°) |
| Negatively charged | 0.0741 (5°) | 0.0462 (6°) | 0.0579 (6°) | 0.1239 (3°) | 0.0457 (6°) | 0.02959 (7°) |
| Nonpolar aliphatic | 0.0000 (9°) | 0.0000 (9°) | 0.0210 (9°) | 0.0260 (7°) | 0.0000 (9°) | 0.02596 (8°) |
| Polar uncharged | 0.1196 (3°) | 0.1356 (3°) | 0.0468 (7°) | 0.0000 (9°) | 0.0980 (3°) | 0.03457 (6°) |
| Positively charged | 0.3450 (1°) | 0.3123 (1°) | 0.3284 (1°) | 0.3107 (1°) | 0.3076 (1°) | 0.26796 (1°) |
| Length | 0.2832 (2°) | 0.2365 (2°) | 0.2501 (2°) | 0.2245 (2°) | 0.1909 (2°) | 0.14375 (2°) |
| Volume | 0.0600 (6°) | 0.0500 (5°) | 0.0956 (3°) | 0.0676 (4°) | 0.0224 (7°) | 0.08408 (3°) |
The values were obtained by running the IG method in each training set. The rank created with IG values, shown in parentheses, highlights the importance of each attribute for each species.
IG values for datasets composed of RPs and HPs.
| Attribute |
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| Aromatic | 0.0000 (8°) | 0.0733 (5°) | 0.0000 (6°) | 0.0000 (4°) | 0.0357 (4°) | 0.0431 (7°) |
| Hydrophobicity | 0.0495 (6°) | 0.0106 (9°) | 0.0193 (5°) | 0.0000 (4°) | 0.0000 (6°) | 0.0252 (9°) |
| Molecular mass | 0.0758 (5°) | 0.0607 (7°) | 0.0576 (4°) | 0.0000 (4°) | 0.0582 (3°) | 0.0961 (3°) |
| Negatively charged | 0.1046 (4°) | 0.0887 (4°) | 0.0000 (6°) | 0.0619 (3°) | 0.0000 (6°) | 0.0864 (4°) |
| Nonpolar aliphatic | 0.0000 (8°) | 0.0398 (8°) | 0.0000 (6°) | 0.0000 (4°) | 0.0000 (6°) | 0.0315 (8°) |
| Polar uncharged | 0.0324 (7°) | 0.0888 (3°) | 0.0000 (6°) | 0.0000 (4°) | 0.0251 (5°) | 0.0507 (6°) |
| Positively charged | 0.1218 (2°) | 0.1261 (2°) | 0.0945 (2°) | 0.0930 (2°) | 0.0000 (6°) | 0.1560 (1°) |
| Length | 0.1094 (3°) | 0.2398 (1°) | 0.1279 (1°) | 0.2072 (1°) | 0.1559 (1°) | 0.0526 (5°) |
| Volume | 0.1742 (1°) | 0.0654 (6°) | 0.0743 (3°) | 0.0000 (4°) | 0.0687 (2°) | 0.1268 (2°) |
The values were obtained by running the IG method in each training set. The rank created with IG values, shown in parentheses, highlights the importance of each attribute for each species.
Figure 1Density plots for the attribute ‘Positively charged’ in A. thaliana datasets.
Summarized results of the classification models built with the RPs/NRPs datasets.
| Test species | Accuracy | Sensitivity | Precision | F-measure | Specificity | MCC |
|---|---|---|---|---|---|---|
|
| 0.93 | 0.93 | 0.92 | 0.92 | 0.84 | 0.80 |
| 0.94 | 0.94 | 0.94 | 0.94 | 0.87 | 0.84 | |
| 0.94 | 0.94 | 0.94 | 0.94 | 0.88 | 0.84 | |
|
| 0.90 | 0.90 | 0.90 | 0.90 | 0.79 | 0.74 |
| 0.94 | 0.94 | 0.94 | 0.94 | 0.89 | 0.84 | |
| 0.94 | 0.94 | 0.94 | 0.94 | 0.88 | 0.84 | |
|
| 0.90 | 0.90 | 0.90 | 0.90 | 0.82 | 0.74 |
| 0.92 | 0.92 | 0.91 | 0.91 | 0.83 | 0.78 | |
| 0.91 | 0.91 | 0.91 | 0.91 | 0.81 | 0.76 | |
|
| 0.90 | 0.90 | 0.90 | 0.90 | 0.83 | 0.75 |
| 0.92 | 0.92 | 0.92 | 0.92 | 0.83 | 0.79 | |
| 0.92 | 0.92 | 0.92 | 0.92 | 0.84 | 0.79 | |
|
| 0.90 | 0.90 | 0.89 | 0.89 | 0.79 | 0.72 |
| 0.90 | 0.90 | 0.90 | 0.90 | 0.79 | 0.73 | |
| 0.90 | 0.90 | 0.90 | 0.90 | 0.80 | 0.74 | |
|
| 0.86 | 0.86 | 0.86 | 0.86 | 0.76 | 0.64 |
| 0.90 | 0.90 | 0.90 | 0.90 | 0.79 | 0.73 | |
| 0.90 | 0.90 | 0.90 | 0.90 | 0.79 | 0.74 |
For each tested species, the first line refers to the mean values obtained in the inter-species tests, i.e., the tests where the ML models are trained with proteins of the other five species (see complete results in Supplementary Table S1), while the second and third lines present the results of a 10-fold cross validation and a jackknife test, respectively.
Summarized results of the classification models built with the RPs/HPs datasets.
| Test species | Accuracy | Sensitivity | Precision | F-measure | Specificity | MCC |
|---|---|---|---|---|---|---|
|
| 0.87 | 0.87 | 0.88 | 0.85 | 0.66 | 0.65 |
| 0.94 | 0.94 | 0.94 | 0.94 | 0.88 | 0.86 | |
| 0.94 | 0.94 | 0.94 | 0.94 | 0.87 | 0.85 | |
|
| 0.89 | 0.89 | 0.89 | 0.89 | 0.69 | 0.65 |
| 0.96 | 0.96 | 0.96 | 0.96 | 0.87 | 0.87 | |
| 0.96 | 0.96 | 0.96 | 0.96 | 0.87 | 0.87 | |
|
| 0.86 | 0.86 | 0.85 | 0.85 | 0.60 | 0.54 |
| 0.89 | 0.89 | 0.89 | 0.89 | 0.73 | 0.66 | |
| 0.90 | 0.90 | 0.90 | 0.90 | 0.75 | 0.69 | |
|
| 0.91 | 0.91 | 0.91 | 0.91 | 0.73 | 0.69 |
| 0.94 | 0.94 | 0.93 | 0.93 | 0.76 | 0.77 | |
| 0.94 | 0.94 | 0.94 | 0.93 | 0.77 | 0.78 | |
|
| 0.88 | 0.88 | 0.88 | 0.88 | 0.65 | 0.62 |
| 0.91 | 0.91 | 0.91 | 0.90 | 0.68 | 0.70 | |
| 0.91 | 0.91 | 0.91 | 0.90 | 0.68 | 0.71 | |
|
| 0.85 | 0.85 | 0.84 | 0.83 | 0.59 | 0.52 |
| 0.91 | 0.91 | 0.91 | 0.91 | 0.77 | 0.75 | |
| 0.92 | 0.92 | 0.91 | 0.91 | 0.78 | 0.75 |
For each tested species, the first line refers to the mean values obtained in the inter-species tests, i.e., the tests where the ML models are trained with proteins of the other five species (see complete results in Supplementary Table S2), while the second and third lines present the results of a 10-fold cross validation and a jackknife test, respectively.
Figure 2Illustration of the Rama method using a sequence of A. thaliana as an example. The proposed approach is described in 12 parts. In A and B, the built ML models are depicted. Parts C, D, E, F, and G comprise step 1 of Rama. In C, it is shown the input (amino acid sequence and species, e.g., A. thaliana) to the models for RPs/NRPs classification, according to the selected species. In D, the input protein is subjected to these models. In E, the probabilities generated by the models are computed according to Equation 9, producing either class 0 or class 1. In case of class 0 (mean probability < 0.5), i.e., not a ribosome protein (F), the program stops and informs the result. In case of class 1 (mean probability ≥ 0.5), i.e., a ribosome-like protein (G), the same amino acid sequence is used as input to the step 2 of the method, which is illustrated in parts H, I, J, K, and L. In H, the amino acid sequence is now given as input to the classification models for RPs/HPs (I). In J, the probabilities of these models are aggregated to generate the final classification. It can be either 0, classified as a histone protein (K), or 1, classified as a ribosomal protein (L). The default discriminant probability 0.5 can be altered.
Results obtained by applying the whole pipeline of Rama.
| Test species | Accuracy | Sensitivity | Precision | F-measure | Specificity | MCC |
|---|---|---|---|---|---|---|
|
| 0.93 | 0.93 | 0.93 | 0.93 | 0.85 | 0.82 |
|
| 0.87 | 0.87 | 0.87 | 0.87 | 0.76 | 0.66 |
|
| 0.93 | 0.93 | 0.93 | 0.93 | 0.85 | 0.81 |
|
| 0.92 | 0.92 | 0.92 | 0.93 | 0.85 | 0.80 |
|
| 0.92 | 0.92 | 0.92 | 0.92 | 0.83 | 0.80 |
|
| 0.91 | 0.91 | 0.90 | 0.90 | 0.80 | 0.75 |
| Mean | 0.91 | 0.91 | 0.91 | 0.91 | 0.82 | 0.77 |
For each species, the values are a result of the application of the whole pipeline using the ML models built for both stages with sequences of the other five species.
Figure 3Nucleic acid binding activity of novel predicted RPs. HA fused-Histone H3, HA fused-ribosomal protein RPL10 (AT1G14320), and the novel predicted RPs AT3G51010 and AT4G11385 (also fused to HA) were in vitro transcribed/translated (input), incubated with sepharose bound-ssDNA, sepharose bound-dsDNA or A. thaliana streptavidin-beads bound-biotinylated RNA (bRNA), pulled-down, separated by SDS-PAGE, and immunoblotted with anti-HA serum. Input shows an immunoblot of the in vitro transcribed/translated HA fusions. The sizes and positions of protein molecular mass markers are shown on the left.
Training set composed of RPs and NRPs. Number of RPs and NRPs that make up the training set for each species.
| Training set | Number of RPs | Number of NRPs |
|---|---|---|
|
| 516 | 1548 |
|
| 1085 | 3255 |
|
| 539 | 1617 |
|
| 176 | 528 |
|
| 485 | 1455 |
|
| 1408 | 4224 |
Training set composed of RPs and HPs. Number of RPs and HPs that make up the training set for each species.
| Training set | Number of RPs | Number of HPs |
|---|---|---|
|
| 516 | 182 |
|
| 1085 | 269 |
|
| 539 | 112 |
|
| 176 | 46 |
|
| 485 | 119 |
|
| 1408 | 395 |
Description of the classification attributes and the type of amino acids used to calculate them.
| Attribute | Attribute description | Amino acid types used to calculate the attribute value |
|---|---|---|
| Aromatic | Proportion of aromatic amino acids | F, Y, and W |
| Negatively charged | Proportion of negatively-charged amino acids | D and E |
| Nonpolar aliphatic | Proportion of nonpolar-aliphatic amino acids | G, A, P, V, L, I, and M |
| Polar uncharged | Proportion of polar-uncharged amino acids | S, T, C, N, and Q |
| Positively charged | Proportion of positively-charged amino acids | K, H, and R |
| Hydrophobicity | Average hydropathy index of amino acids in the sequence | All 20 amino acids |
| Molecular mass | Average mass of amino acids in the sequence | All 20 amino acids |
| Volume | Average volume of amino acids in the sequence | All 20 amino acids |
| Length | Total number of amino acids in the sequence | All 20 amino acids |