| Literature DB >> 32269586 |
Ruiyan Hou1,2, Lida Wang3, Yi-Jun Wu1.
Abstract
ATP-binding cassette (ABC) proteins play important roles in a wide variety of species. These proteins are involved in absorbing nutrients, exporting toxic substances, and regulating potassium channels, and they contribute to drug resistance in cancer cells. Therefore, the identification of ABC transporters is an urgent task. The present study used 188D as the feature extraction method, which is based on sequence information and physicochemical properties. We also visualized the feature extracted by t-Distributed Stochastic Neighbor Embedding (t-SNE). The sample based on the features extracted by 188D may be separated. Further, random forest (RF) is an efficient classifier to identify proteins. Under the 10-fold cross-validation of the model proposed here for a training set, the average accuracy rate of 10 training sets was 89.54%. We obtained values of 0.87 for specificity, 0.92 for sensitivity, and 0.79 for MCC. In the testing set, the accuracy achieved was 89%. These results suggest that the model combining 188D with RF is an optimal tool to identify ABC transporters.Entities:
Keywords: 188D; ABC transporters; classify; random forest; t-SNE
Year: 2020 PMID: 32269586 PMCID: PMC7109328 DOI: 10.3389/fgene.2020.00156
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1The structure of ATP-binding cassette (ABC) transporters.
FIGURE 2Overall Process of this study.
Three class divided according to physicochemical property.
| Physicochemical property | the 1st class | the 2nd class | the 3rd class |
| hydrophobicity | RKEDQN | GASTPHY | CVLIMFW |
| Normalized van der Waals volume | GASCTPD | NVEQIL | MHKFRYW |
| polarity | LIFWCMVY | PATGS | HQRKNED |
| polarizability | GASDT | CPNVEQIL | KMHFRYW |
| charge | KR | ANCQGHILMFPSTWYV | DE |
| surface tension | GQDNAHR | KTSEC | ILMFPWYV |
| secondary structure | EALMQKRH | VIYCWFT | GNPSD |
| solvent accessibility | ALFCGIVW | RKQEND | MPSTHY |
FIGURE 3Flowchart of the 188D feature extraction method.
FIGURE 4Display of the training features by t-SNE. S is the abbreviation for sample.
FIGURE 5Performance of different classifier. (A) Accuracy comparison for the training sets by using all the classification methods. (B) Sensitivity comparison for the training sets using all of the classification methods. (C) Specificity comparison for the training sets by using all of the classification methods. (D) Matthew’s correlation coefficient comparison for training sets by using all the classification methods. (E) Accuracy comparison for the testing sets by using all of the classification methods.
FIGURE 6ROC curve to compare difference classifiers. S is the abbreviation for sample.