| Literature DB >> 36171889 |
Dong Chen1, Sai Li2, Yu Chen2.
Abstract
Sucrose transporter (SUT) is a type of transmembrane protein that exists widely in plants and plays a significant role in the transportation of sucrose and the specific signal sensing process of sucrose. Therefore, identifying sucrose transporter is significant to the study of seed development and plant flowering and growth. In this study, a random forest-based model named ISTRF was proposed to identify sucrose transporter. First, a database containing 382 SUT proteins and 911 non-SUT proteins was constructed based on the UniProt and PFAM databases. Second, k-separated-bigrams-PSSM was exploited to represent protein sequence. Third, to overcome the influence of imbalance of samples on identification performance, the Borderline-SMOTE algorithm was used to overcome the shortcoming of imbalance training data. Finally, the random forest algorithm was used to train the identification model. It was proved by 10-fold cross-validation results that k-separated-bigrams-PSSM was the most distinguishable feature for identifying sucrose transporters. The Borderline-SMOTE algorithm can improve the performance of the identification model. Furthermore, random forest was superior to other classifiers on almost all indicators. Compared with other identification models, ISTRF has the best general performance and makes great improvements in identifying sucrose transporter proteins.Entities:
Keywords: biological sequence analysis; machine learning; protein identification; random forest; sucrose transporter
Year: 2022 PMID: 36171889 PMCID: PMC9511101 DOI: 10.3389/fgene.2022.1012828
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.772
FIGURE 1Frame chart of ISTRF.
Self-built dataset.
| Dataset | SUT | Non-SUT |
|---|---|---|
| Training dataset | 306 | 729 |
| Testing dataset | 76 | 182 |
Result of various feature extraction methods using random forest without Borderline-SMOTE on 10-fold cross-validation.
| Feature | SN | SP | ACC | MCC | F-measure |
|---|---|---|---|---|---|
| 188D | 0.895 | 0.970 | 0.948 | 0.874 | 0.910 |
| PSSM composition | 0.876 | 0.967 | 0.940 | 0.855 | 0.896 |
| k-separated-bigrams-PSSM |
| 0.973 |
|
|
|
| 188D + PSSM composition | 0.895 | 0.973 | 0.950 | 0.878 | 0.913 |
| 188D + k-separated-bigrams-PSSM |
| 0.973 |
|
|
|
| PSSM composition + k-separated-bigrams-PSSM | 0.908 |
| 0.957 | 0.897 | 0.927 |
| 188D + PSSM composition + k-separated-bigrams-PSSM | 0.918 | 0.973 | 0.957 | 0.895 | 0.926 |
Bold values in the table indicate the best results.
Result of various feature extraction methods using SGD without Borderline-SMOTE on 10-fold cross-validation.
| Feature | SN | SP | ACC | MCC | F-measure |
|---|---|---|---|---|---|
| 188D | 0.866 | 0.951 | 0.926 | 0.821 | 0.873 |
| PSSM composition | 0.873 | 0.956 | 0.931 | 0.834 | 0.883 |
| k-separated-bigrams-PSSM |
| 0.952 |
|
|
|
| 188D + PSSM composition | 0.902 | 0.959 | 0.942 | 0.861 | 0.902 |
| 188D + k-separated-bigrams-PSSM | 0.912 | 0.952 | 0.940 | 0.857 | 0.900 |
| PSSM composition + k-separated-bigrams-PSSM | 0.905 |
| 0.949 | 0.877 | 0.913 |
| 188D + PSSM composition + k-separated-bigrams-PSSM | 0.912 | 0.960 | 0.946 | 0.870 | 0.909 |
Bold values in the table indicate the best results.
Result of various features using random forest with Borderline-SMOTE on 10-fold cross-validation.
| Feature | SN | SP | ACC | MCC | F-measure |
|---|---|---|---|---|---|
| 188D | 0.989 + 9.4 | 0.937–3.3 | 0.963 + 1.5 | 0.927 + 5.3 | 0.964 + 5.4 |
| PSSM composition | 0.982 + 10.6 | 0.952–1.5 | 0.967 + 2.7 | 0.935 + 8 | 0.968 + 7.2 |
| k-separated-bigrams-PSSM | 0.986 + 6.1 | 0.970–0.3 | 0.978 + 2 | 0.956 + 5.6 | 0.978 + 4.9 |
| 188D + PSSM composition | 0.982 + 8.7 | 0.952–2.1 | 0.967 + 1.7 | 0.935 + 5.7 | 0.968 + 5.5 |
| 188D + k-separated-bigrams-PSSM | 0.984 + 5.9 | 0.945–2.8 | 0.964 + 0.6 | 0.929 + 2.9 | 0.965 + 3.6 |
| PSSM composition + k-separated-bigrams-PSSM | 0.984 + 7.6 | 0.957–2.1 | 0.971 + 1.4 | 0.941 + 4.4 | 0.971 + 4.4 |
| 188D + PSSM composition + k-separated-bigrams-PSSM | 0.985 + 6.7 | 0.949–2.4 | 0.967 + 1 | 0.935 + 4 | 0.968 + 4.2 |
Result of various features using SGD with Borderline-SMOTE on 10-fold cross-validation.
| Feature | SN | SP | ACC | MCC | F-measure |
|---|---|---|---|---|---|
| 188D | 0.966 + 10 | 0.909–4.2 | 0.938 + 1.2 | 0.877 + 5.6 | 0.939 + 6.6 |
| PSSM composition | 0.975 + 10.2 | 0.938–1.8 | 0.957 + 2.6 | 0.914 + 8 | 0.958 + 7.5 |
| k-separated-bigrams-PSSM | 0.997 + 3.3 | 0.942–1 | 0.970 + 1.4 | 0.941 + 4.4 | 0.971 + 4.3 |
| 188D + PSSM composition | 0.985 + 8.3 | 0.931–2.8 | 0.958 + 1.6 | 0.918 + 5.7 | 0.959 + 5.7 |
| 188D + k-separated-bigrams-PSSM | 0.985 + 7.3 | 0.931–2.1 | 0.958 + 1.8 | 0.918 + 6.1 | 0.959 + 5.9 |
| PSSM composition + k-separated-bigrams-PSSM | 0.984 + 7.9 | 0.951–1.6 | 0.967 + 1.8 | 0.945 + 6.8 | 0.968 + 5.5 |
| 188D + PSSM composition + k-separated-bigrams-PSSM | 0.988 + 7.6 | 0.940–2 | 0.964 + 1.8 | 0.928 + 5.8 | 0.965 + 5.6 |
FIGURE 2Results of the model with or without Borderline-SMOTE on 10-fold cross-validation.
FIGURE 3ROC curve with or without Borderline-SMOTE. (A) ROC curve without Borderline-SMOTE. (B) ROC curve with Borderline-SMOTE.
FIGURE 4Results of the model with or without Borderline-SMOTE on the test dataset.
Result of various classifiers using k-separated-bigrams-PSSM feature without Borderline-SMOTE on 10-fold cross-validation.
| Classifier | SN | SP | ACC | MCC | F-measure |
|---|---|---|---|---|---|
| SVM | 0.948 | 0.944 | 0.945 | 0.872 | 0.911 |
| NB |
| 0.782 | 0.842 | 0.703 | 0.786 |
| SGD | 0.964 | 0.952 | 0.956 | 0.897 | 0.928 |
| RF | 0.925 |
|
|
|
|
Bold values in the table indicate the best results.
Result of various classifiers using k-separated-bigrams-PSSM feature with Borderline-SMOTE on 10-fold cross-validation.
| Classifier | SN | SP | ACC | MCC | F-measure |
|---|---|---|---|---|---|
| SVM |
| 0.877 | 0.937 | 0.880 | 0.940 |
| NB | 0.989 | 0.774 | 0.881 | 0.781 | 0.893 |
| SGD | 0.997 | 0.942 | 0.970 | 0.941 | 0.971 |
| RF | 0.986 |
|
|
|
|
Bold values in the table indicate the best results.
Experimental result of using different methods.
| Model | ACC | MCC | SN | SP |
|---|---|---|---|---|
| ISTRF |
|
|
| 0.973 |
| BioSeq-SVM | 0.9457 | 0.8694 | 0.9079 | 0.9615 |
| BioSeq-RF | 0.938 | 0.8505 | 0.8026 |
|
Bold values in the table indicate the best results.