| Literature DB >> 35047020 |
Yue Gong1, Benzhi Dong1, Zixiao Zhang1, Yixiao Zhai1, Bo Gao2, Tianjiao Zhang1, Jingyu Zhang3.
Abstract
Vesicular transport proteins are related to many human diseases, and they threaten human health when they undergo pathological changes. Protein function prediction has been one of the most in-depth topics in bioinformatics. In this work, we developed a useful tool to identify vesicular transport proteins. Our strategy is to extract transition probability composition, autocovariance transformation and other information from the position-specific scoring matrix as feature vectors. EditedNearesNeighbours (ENN) is used to address the imbalance of the data set, and the Max-Relevance-Max-Distance (MRMD) algorithm is adopted to reduce the dimension of the feature vector. We used 5-fold cross-validation and independent test sets to evaluate our model. On the test set, VTP-Identifier presented a higher performance compared with GRU. The accuracy, Matthew's correlation coefficient (MCC) and area under the ROC curve (AUC) were 83.6%, 0.531 and 0.873, respectively.Entities:
Keywords: XGBoost; machine learning; position-specific scoring matrix; protein function prediction; vesicular transport proteins
Year: 2022 PMID: 35047020 PMCID: PMC8762342 DOI: 10.3389/fgene.2021.808856
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1Training flow chart of the prediction model of vesicular transport proteins.
Statistics of the dataset in this work.
| Total | Train | Test | |
|---|---|---|---|
| Vesicular | 2,533 | 2,214 | 319 |
| Non-vesicular | 9,086 | 7,573 | 1,513 |
FIGURE 2The values of the different unbalanced data processing methods on the training set.
Evaluation of model performance after processing unbalanced data by ENN.
| Acc | Sens | Spec | Precision | MCC | AUC | |
|---|---|---|---|---|---|---|
| ENN | 0.85 | 0.701 | 0.919 | 0.811 | 0.659 | 0.908 |
FIGURE 3(A) Comparison of single feature extraction methods. (B) Comparison of combining feature extraction methods.
The results of using different sorting methods in MRMD on the training set.
| Dimension | Acc | Sens | Spec | Precision | MCC | AUC | |
|---|---|---|---|---|---|---|---|
| Hits_a | 681 | 0.852 | 0.711 | 0.919 | 0.805 | 0.653 | 0.907 |
| TrustRank | 992 | 0.857 | 0.709 | 0.927 | 0.818 | 0.658 | 0.907 |
| PageRank | 898 | 0.855 | 0.712 | 0.922 | 0.81 | 0.658 | 0.907 |
| LeadeRank | 738 | 0.854 | 0.712 | 0.921 | 0.809 | 0.656 | 0.908 |
| Hits_h | 791 | 0.855 | 0.713 | 0.921 | 0.813 | 0.66 | 0.908 |
Comparison of six performance evaluations on the training set.
| Acc | Sens | Spec | Precision | MCC | AUC | |
|---|---|---|---|---|---|---|
| RF | 0.823 | 0.582 | 0.936 | 0.81 | 0.574 | 0.886 |
| SVM | 0.843 | 0.72 | 0.9 | 0.773 | 0.633 | 0.896 |
| KNN | 0.822 | 0.732 | 0.865 | 0.72 | 0.595 | 0.879 |
| XGBoost | 0.855 | 0.713 | 0.921 | 0.813 | 0.66 | 0.908 |
FIGURE 4ROC curve of vesicle transporters identified by different methods.
Performance comparison between our model and GRU.
| Acc | Sens | Spec | Precision | MCC | AUC | |
|---|---|---|---|---|---|---|
| GRU | 0.809 | 0.708 | 0.829 | 0.515 | 0.459 | 0.850 |
| VTP-Identifier | 0.836 | 0.757 | 0.852 | 0.517 | 0.531 | 0.873 |
FIGURE 5Comparison of PR curves between our model and GRU.