| Literature DB >> 32346070 |
Apiwat Sangphukieo1,2, Teeraphan Laomettachit1, Marasri Ruengjitchatchawalya3,4,5.
Abstract
Identification of novel photosynthetic proteins is important for understanding and improving photosynthetic efficiency. Synergistically, genome neighborhood can provide additional useful information to identify photosynthetic proteins. We, therefore, expected that applying a computational approach, particularly machine learning (ML) with the genome neighborhood-based feature should facilitate the photosynthetic function assignment. Our results revealed a functional relationship between photosynthetic genes and their conserved neighboring genes observed by 'Phylo score', indicating their functions could be inferred from the genome neighborhood profile. Therefore, we created a new method for extracting patterns based on the genome neighborhood network (GNN) and applied them for the photosynthetic protein classification using ML algorithms. Random forest (RF) classifier using genome neighborhood-based features achieved the highest accuracy up to 87% in the classification of photosynthetic proteins and also showed better performance (Mathew's correlation coefficient = 0.718) than other available tools including the sequence similarity search (0.447) and ML-based method (0.361). Furthermore, we demonstrated the ability of our model to identify novel photosynthetic proteins compared to the other methods. Our classifier is available at http://bicep2.kmutt.ac.th/photomod_standalone, https://bit.ly/2S0I2Ox and DockerHub: https://hub.docker.com/r/asangphukieo/photomod.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32346070 PMCID: PMC7189237 DOI: 10.1038/s41598-020-64053-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Protocol for building the model of photosynthetic protein classification and the demonstration of the feature extraction method. The protocol consists of dataset building, feature extraction, data preprocessing, classifier selection and feature selection. (A) Genome neighborhoods are called by using intergenic distance criteria in the feature extraction step. Genes in the same homologous group are indicated by the same color label. The query genes conserved in different four genomes are labeled by Genes 1-4. (B) Relationships between query genes and their neighbors are displayed in GNN. Straight-line represents a homologous relationship, while the dashed line represents genome neighborhood relationships. Color labeled on each node represents a homologous group corresponding to gene color in (A). (C) The genome neighborhood profile of each query gene is represented in table format. The value in the table is the Phylo score, which represents the level of gene neighborhood conservation. E-values (1E-10, 1E-50 and 1E-100) are the thresholds for family classification of proteins coding from the genes.
Figure 2The relationship between conservation score (Phylo score) (x-axis) and F1 measure (y-axis) of Prochlorococcus marinus. The blue line represents the F1 measure for indicating the similarity between GO terms from photosynthetic genes and from their neighbors. The green line represents the F1 measure for indicating the similarity between GO terms from photosynthetic genes and from random GO terms. Coverage (right x-axis) represents the number of predicted proteins.
Classification performance of different classifier algorithms and class balancing methods.
| Class balancing methods | Classifiers | Accuracy | F1 minor | MCC |
|---|---|---|---|---|
| No filter | BayesNet | 0.829 | 0.562 | 0.533 |
| RandomForest | 0.883 | 0.778 | 0.702 | |
| SMO | 0.872 | 0.771 | 0.685 | |
| SMOTE | BayesNet | 0.845 | 0.614 | 0.581 |
| RandomForest | 0.880 | 0.778 | 0.698 | |
| SMO | 0.857 | 0.753 | 0.658 | |
| ClassBalancer | BayesNet | 0.833 | 0.571 | 0.546 |
| RandomForest | 0.858 | 0.758 | 0.664 | |
| SMO | 0.853 | 0.753 | 0.656 |
The number of selected features for each E-value criterion.
| E-value criteria | Number of selected features |
|---|---|
| 1E-10 | 1,156 (58%) |
| 1E-50 | 643 (32%) |
| 1E-100 | 201 (10%) |
| Total | 2,000 |
Performance comparison among different methods in the classification of photosynthetic proteins using two nested 10-fold cross-validation.
| Prediction methods | Accuracy | F1 minor | |
|---|---|---|---|
| Blastp (sequence similarity search) | 0.818 ± 0.032 | 0.499 ± 0.080 | 0.447 ± 0.090 |
| SCMPSP (sequence-based model) | 0.768 ± 0.033 | 0.506 ± 0.060 | 0.361 ±0.078 |
| PhotoMod (genome neighborhood-based model) | 0.874 ± 0.016* | 0.736 ± 0.043* | 0.718 ± 0.042* |
*Significant difference (P < 0.01) compared to other methods tested by Wilcoxon signed-rank tests
Performance comparison of different methods in the classification of novel photosynthetic proteins.
| Methods | TP | TN | FP | FN | Precision | Recall | Accuracy | F1 minor | MCC |
|---|---|---|---|---|---|---|---|---|---|
| Blastp | 3 | 94 | 17 | 9 | 0.150 | 0.250 | 0.789 | 0.188 | 0.078 |
| SVMProt | 6 | 74 | 37 | 6 | 0.140 | 0.500 | 0.650 | 0.218 | 0.104 |
| DeepGO | 0 | 111 | 0 | 12 | — | — | — | — | — |
| SCMPSP | 3 | 76 | 35 | 9 | 0.079 | 0.250 | 0.642 | 0.120 | -0.042 |
| PhotoMod | 6 | 89 | 22 | 6 | 0.214 | 0.500 | 0.772 | 0.300 | 0.214 |
Figure 3Genome neighborhood networks of novel photosynthetic genes used in this study. The protein clusters that are matched by the query sequence are represented by a hexagon. The circles represent protein cluster from the neighboring genes of which edges show gene neighborhood relationships with the query cluster. The size of the circle indicates the conservation (corresponding to Phylo score). The proteins are clustered with E-value cutoff 1E-10. The prediction results of these 12 proteins are shown in Supplementary Table S8.