| Literature DB >> 34309504 |
Qian Wang1, Jun Ye2, Teng Xu3, Ning Zhou4, Zhongqiu Lu4, Jianchao Ying4,5.
Abstract
Identification of prokaryotic transposases (Tnps) not only gives insight into the spread of antibiotic resistance and virulence but the process of DNA movement. This study aimed to develop a classifier for predicting Tnps in bacteria and archaea using machine learning (ML) approaches. We extracted a total of 2751 protein features from the training dataset including 14852 Tnps and 14852 controls, and selected 75 features as predictive signatures using the combined mutual information and least absolute shrinkage and selection operator algorithms. By aggregating these signatures, an ensemble classifier that integrated a collection of individual ML-based classifiers, was developed to identify Tnps. Further validation revealed that this classifier achieved good performance with an average AUC of 0.955, and met or exceeded other common methods. Based on this ensemble classifier, a stand-alone command-line tool designated TnpDiscovery was established to maximize the convenience for bioinformaticians and experimental researchers toward Tnp prediction. This study demonstrates the effectiveness of ML approaches in identifying Tnps, facilitating the discovery of novel Tnps in the future.Entities:
Keywords: feature selection; machine learning; prokaryotic transposase; protein classifier; protein feature
Mesh:
Substances:
Year: 2021 PMID: 34309504 PMCID: PMC8477400 DOI: 10.1099/mgen.0.000611
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Fig. 1.Predictive potential of the 2751 protein features in the classification of Tnps. (a) ROC curves for the top ten features with the best predictive performance in the training dataset. (b) Embedding of 2751 features from 18 descriptors using t-SNE. The red and blue dots represent Tnps and controls, respectively. (c) Statistics of the features selected by both MI and LASSO methods. The number at the top of the bar represents the number of the selected features in each protein descriptor. (d) Embedding of 75 features from 15 descriptors using t-SNE. The red and blue dots represent Tnps and controls, respectively. (e) Unsupervised hierarchical clustering and heatmap of the training dataset based on the 75 features selected.
The 75 feature signatures selected in this study
|
Descriptor |
Dimension |
Selection |
Selected features |
|---|---|---|---|
|
AAC |
20 |
3 |
H, R, V |
|
APAAC |
80 |
0 |
– |
|
CKSAAGP |
150 |
18 |
uncharger.postivecharger.gap5, postivecharger.aromatic.gap0, uncharger.postivecharger.gap2, postivecharger.uncharger.gap0, postivecharger.uncharger.gap2, aromatic.postivecharger.gap4, alphaticr.negativecharger.gap3, alphaticr.alphaticr.gap3, postivecharger.uncharger.gap4, negativecharger.alphaticr.gap4, alphaticr.negativecharger.gap5, postivecharger.postivecharger.gap1, postivecharger.aromatic.gap1, uncharger.postivecharger.gap1, aromatic.postivecharger.gap0, alphaticr.alphaticr.gap2, uncharger.postivecharger.gap3, postivecharger.uncharger.gap3 |
|
CTDC |
39 |
2 |
solventaccess.G3, polarizability.G2 |
|
CTDD |
195 |
7 |
charge.2.residue100, charge.3.residue25, charge.2.residue75, hydrophobicity_FASG890101.1.residue75, polarity.3.residue100, charge.3.residue75, polarity.3.residue75 |
|
CTDT |
39 |
2 |
charge.Tr1221, hydrophobicity_ENGD860101.Tr1221 |
|
CTriad |
343 |
10 |
g3.g3.g4, g5.g4.g3, g5.g3.g5, g3.g5.g3, g5.g5.g3, g2.g5.g5, g3.g5.g4, g4.g5.g5, g2.g5.g3, g5.g5.g4 |
|
DDE |
400 |
19 |
DR, RL, KR, RC, HL, HR, DI, VI, GE, RQ, RT, RK, RS, CL, RR, YS, PF, GD, RW |
|
DPC |
400 |
1 |
YN |
|
GAAC |
5 |
1 |
postivecharge |
|
GDPC |
25 |
3 |
postivecharger.aromatic, aromatic.postivecharger, postivecharger.uncharger |
|
Geary |
240 |
1 |
CHAM810101.lag3 |
|
GTPC |
125 |
2 |
negativecharger.negativecharger.alphaticr, aromatic.negativecharger.alphaticr |
|
Moran |
240 |
1 |
CIDH920105.lag4 |
|
NMBroto |
240 |
0 |
– |
|
PAAC |
50 |
3 |
Xc1.I, Xc1.V, Xc1.F |
|
QSOrder |
60 |
2 |
Grantham.Xr.C, Grantham.Xr.W |
|
SOCNumber |
100 |
0 |
– |
|
Total |
2751 |
75 |
Fig. 2.Classifier construction for predicting Tnps using the training dataset. (a) Classification metrics for evaluating the performance of DL, GBM, and XGB algorithms based on ten-fold CV. (b) Classification metrics for evaluating the performance of the best performing classifier (GBM-best) and two ensemble classifiers based on ten-fold CV. The red star indicates the best performance amongst these algorithms or classifiers. ROC curves (c) and the confusion matrix plots (d) of GBM-best and two ensemble classifiers based on the entire training dataset. On the confusion matrix plot, the rows correspond to the predicted class and the columns correspond to the true class. The green cells correspond to observations that are correctly classified. The red cells correspond to incorrectly classified observations. Both the number of observations and the percentage of the total number of observations are shown in each cell. The column on the far right of the plot shows the percentages of all the samples predicted to belong to each class that are correctly (precision) and incorrectly (false discovery rate) classified. The row at the bottom of the plot shows the percentages of all the samples belonging to each class that are correctly (recall) and incorrectly (false negative rate) classified. The cell at the bottom right of the plot shows the overall accuracy.
Fig. 3.Classifier evaluation using the validation datasets. Classification metrics were used to evaluate the performance of GBM-best, EC-all, and EC-best in ten randomly produced validation datasets. The red star indicates the best performance amongst these classifiers.
Fig. 4.Performance comparison between EC-all and two existing approaches for Tnp prediction using the testing dataset. Shown are ROC curves (a) and significance matrix plots (b) for these three methods and their corresponding combined methods. The entry values of the significance matrix represent the p-values for the comparison of the AUC of two ROC curves.