| Literature DB >> 33087125 |
Trinh-Trung-Duong Nguyen1, Nguyen-Quoc-Khanh Le2,3, Quang-Thai Ho1, Dinh-Van Phan4, Yu-Yen Ou5.
Abstract
BACKGROUND: Cytokines are a class of small proteins that act as chemical messengers and play a significant role in essential cellular processes including immunity regulation, hematopoiesis, and inflammation. As one important family of cytokines, tumor necrosis factors have association with the regulation of a various biological processes such as proliferation and differentiation of cells, apoptosis, lipid metabolism, and coagulation. The implication of these cytokines can also be seen in various diseases such as insulin resistance, autoimmune diseases, and cancer. Considering the interdependence between this kind of cytokine and others, classifying tumor necrosis factors from other cytokines is a challenge for biological scientists.Entities:
Keywords: Binary classification; Feature extraction; Machine learning; Natural language processing
Year: 2020 PMID: 33087125 PMCID: PMC7579990 DOI: 10.1186/s12920-020-00779-w
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
Number of features input in our binary classifiers
| Feature types | Number of features |
|---|---|
| 1-g | 20 |
| 2-g | 395➔398 |
| 3-g | 1736 ➔ 1915 |
| 4-g | 60 ➔ 83 |
| 5-g | 6 ➔ 11 |
| 1-g and 2-g combined | 415 ➔ 418 |
| 1-g and 3-g combined | 1756 ➔ 1935 |
| 2-g and 3-g combined | 2131 ➔ 2313 |
Fig. 1Amino acid composition of the surveyed TNF and non-TNF proteins. The amino acid K, G, L, I occurred at the significant frequency difference
Fig. 2The variance of amino acid composition of the surveyed TNF and non-TNF proteins
AUC performance of SVM classifier on embedding features with different biological word lengths
| Feature | AUC | |
|---|---|---|
| 5-fold cross-validation | Independent | |
| 1-g | 0.856 ± 0.485 | 0.848 ± 0.525 |
| 2-g | 0.901 ± 0.416 | 0.883 ± 0.599 |
| 3-g | 0.934 ± 0.423 | 1 ± 0 |
| 4-g | 0.617 ± 0.63 | 0.563 ± 0.751 |
| 5-g | 0.543 ± 0.539 | 0.574 ± 0.686 |
| 1-g and 2-g combined | 0.952 ± 0.416 | 0.934 ± 0.48 |
| 1-g and 3-g combined | 0.96 ± 0.42 | 0.921 ± 0.497 |
| 2-g and 3-g combined | ||
(Each result is reported in format: m ± d, where m is the mean and d is the standard deviation across the ten runs)
Performance comparison of proposed features with AAC, DPC, PSSM, and the combined features with highest performance values for each class highlighted in bold
| Cross-validation data | ||||
|---|---|---|---|---|
| Feature types | Acc (%) | Spec (%) | Sen (%) | MCC |
| AAC | 60.69 ± 12.13 | 59.61 ± 15.82 | 68.01 ± 19.77 | 0.24 ± 0.09 |
| DPC | 70.42 ± 13.97 | 72.89 ± 17.91 | 51.67 ± 20.14 | 0.24 ± 0.13 |
| AAC-DPC | 86.48 ± 5.82 | 88.83 ± 8.04 | 69.34 ± 16.62 | 0.53 ± 0.08 |
| PSSM | 89.57 ± 6.68 | 91.88 ± 5.73 | 73.34 ± 18.33 | 0.61 ± 0.16 |
| PSSM-AAC | 91.17 ± 3.18 | 93.19 ± 4.11 | 76.34 ± 12.91 | 0.66 ± 0.1 |
| PSSM-DPC | 91.25 ± 3.29 | 93.85 ± 4.58 | 72.34 ± 12.58 | 0.64 ± 0.09 |
| PSSM-DPC-AAC | 91.25 ± 2.84 | 93.67 ± 3.85 | 73.67 ± 14.45 | 0.64 ± 0.1 |
| Proposed features | ||||
| Independent data | ||||
| Feature types | Acc (%) | Spec (%) | Sen (%) | MCC |
| AAC | 46.92 ± 25.59 | 42.37 ± 33.15 | 81.25 ± 34.49 | 0.20 ± 0.07 |
| DPC | 82.05 ± 23.71 | 84.63 ± 29.06 | 62.75 ± 39.92 | 0.46 ± 0.28 |
| AAC-DPC | 93.65 ± 2.66 | 94.95 ± 3.26 | 83.75 ± 9.59 | 0.73 ± 0.09 |
| PSSM | 94.85 ± 2.82 | 97.56 ± 2.85 | 74.5 ± 28.35 | 0.72 ± 0.26 |
| PSSM-AAC | 95.77 ± 1.45 | 97.42 ± 2.14 | 83.25 ± 11.31 | 0.81 ± 0.06 |
| PSSM-DPC | 95.94 ± 1.33 | 97.81 ± 1.49 | 82 ± 8.8 | 0.81 ± 0.07 |
| PSSM-DPC-AAC | 95.14 ± 2.02 | 96.77 ± 2.48 | 83 ± 9.7 | 0.78 ± 0.08 |
| Proposed features | ||||
(Each result is reported in format: m ± d, where m is the mean and d is the standard deviation across the ten runs)
Statistics of the surveyed TNF and non-TNF sequences
| Original | After 20% similarity check | Cross-validation | Independent | |
|---|---|---|---|---|
| TNF | 106 | 18 | 14 | 4 |
| Non-TNF | 1023 | 133 | 103 | 30 |
Performance comparison of five commonly used binary classifiers on proposed features
| Classifier | Cross-validation data | |||
|---|---|---|---|---|
| Acc (%) | Spec (%) | Sen (%) | MCC | |
| SVM | ||||
| kNN | 77.33 ± 3.7 | 75.41 ± 3.98 | 100 ± 0 | 0.47 ± 0.03 |
| RandomForest | 94.22 ± 2.3 | 94.20 ± 2.9 | 94 ± 8.43 | 0.75 ± 0.05 |
| Naïve Bayes | 21.59 ± 10.62 | 14.76 ± 11.45 | 100 ± 0 | 0.09 ± 0.06 |
| QuickRBF | 94.80 ± 1.52 | 99.81 ± 0.4 | 57.99 ± 14.25 | 0.72 ± 0.09 |
| Independent data | ||||
| Acc (%) | Spec (%) | Sen (%) | MCC | |
| SVM | ||||
| kNN | 79.39 ± 8.9 | 78.01 ± 10.57 | 93.34 ± 14.04 | 0.47 ± 0.09 |
| RandomForest | 97.28 ± 2.25 | 99 ± 2.26 | 80.01 ± 23.31 | 0.84 ± 0.14 |
| Naïve Bayes | 19.09 ± 23.76 | 10.99 ± 26.15 | 100 ± 0 | 0.08 ± 0.17 |
| QuickRBF | 94.12 ± 1.97 | 100 ± 0 | 50 ± 16.7 | 0.68 ± 0.13 |
(Each result is reported in format: m ± d, where m is the mean and d is the standard deviation across the ten runs)
Fig. 3The flowchart of this study. First, surveyed dataset was used to train a FastText model and then we used this trained model to generate word embedding vectors. Next, word embedding-based vectors were created. In the end, support vector machine classifier was used for classification
Fig. 4The 4-step flowchart demonstrating our method for using word embedding vectors as protein features. In this illustration, 2 sequences were used and segmentation size is equal to 3