| Literature DB >> 32082326 |
Dilraj Kaur1, Chakit Arora1, Gajendra P S Raghava1.
Abstract
This study describes a method developed for predicting pattern recognition receptors (PRRs), which are an integral part of the immune system. The models developed here were trained and evaluated on the largest possible non-redundant PRRs, obtained from PRRDB 2.0, and non-pattern recognition receptors (Non-PRRs), obtained from Swiss-Prot. Firstly, a similarity-based approach using BLAST was used to predict PRRs and got limited success due to a large number of no-hits. Secondly, machine learning-based models were developed using sequence composition and achieved a maximum MCC of 0.63. In addition to this, models were developed using evolutionary information in the form of PSSM composition and achieved maximum MCC value of 0.66. Finally, we developed hybrid models that combined a similarity-based approach using BLAST and machine learning-based models. Our best model, which combined BLAST and PSSM based model, achieved a maximum MCC value of 0.82 with an AUROC value of 0.95, utilizing the potential of both similarity-based search and machine learning techniques. In order to facilitate the scientific community, we also developed a web server "PRRpred" based on the best model developed in this study (http://webs.iiitd.edu.in/raghava/prrpred/).Entities:
Keywords: BLAST; innate immunity; machine learning; pattern recognition receptors; prediction; toll-like receptors
Mesh:
Substances:
Year: 2020 PMID: 32082326 PMCID: PMC7002473 DOI: 10.3389/fimmu.2020.00071
Source DB: PubMed Journal: Front Immunol ISSN: 1664-3224 Impact factor: 7.561
Figure 1Distribution of the sequences in negative and positive clusters obtained from CD-HIT. x-axis represents the number of sequences and y-axis represents number of clusters that have those number of sequences. Most of the positive and negative clusters have a smaller number of sequences, while there are a few clusters with a comparatively larger number of sequences.
Figure 2The flowchart explains the process of fractioning positive clusters obtained from CD-HIT into five subsets. The numbers in the parentheses, following the cluster names, represent the number of sequences in that cluster. As a result, Subset 1 contains sequences of clusters 1, 6, 11, …, 106; Subset 2 contains sequences of cluster 2, 7, 12, …, 102; Subset 3 contains sequences of cluster 3, 8, 13, …, 103; Subset 4 contains sequences of cluster 4, 9, 14, …, 104; and Subset 5 contains sequences of cluster 5, 10, 15, …, 105.
The performance of BLAST on training and testing dataset using five-fold cross validation. PRRs, and non-PRRs were searched at different e-values of BLAST.
| 10−9 | 133 (74.30) | 4 (1.45) | 89 (32.48) | 3 (1.67) |
| 10−8 | 134 (74.86) | 4 (1.45) | 90 (32.84) | 4 (2.23) |
| 10−7 | 134 (74.86) | 5 (1.82) | 90 (32.84) | 4 (2.23) |
| 10−6 | 135 (75.41) | 5 (1.82) | 93 (33.94) | 4 (2.23) |
| 10−5 | 136 (75.97) | 7 (2.55) | 98 (35.76) | 5 (2.79) |
| 10−4 | 136 (75.97) | 7 (2.55) | 99 (36.13) | 6 (3.35) |
| 10−3 | 138 (77.09) | 8 (2.92) | 101 (36.86) | 6 (3.35) |
| 10−2 | 139 (77.65) | 10 (3.64) | 102 (37.22) | 6 (3.35) |
| 10−1 | 140 (78.21) | 20 (7.29) | 107 (39.05) | 7 (3.91) |
| 1 | 147 (82.12) | 65 (23.72) | 135 (49.27) | 18 (10.05) |
Figure 3The percent amino acid composition of pattern recognition receptors and non-pattern recognition receptor proteins.
The performance of different machine learning techniques-based models on PRR dataset developed using AAC of protein sequences.
| ET | ne = 90 | 80.71 | 82.56 | 81.73 | 0.90 | 0.63 | 77.06 | 84.08 | 82.46 | 0.88 | 0.63 |
| SVM | C = 5, g = 0.01, k = rbf | 78.07 | 83.83 | 81.62 | 0.87 | 0.62 | 77.95 | 82.31 | 81.06 | 0.88 | 0.60 |
| RF | ne = 100 | 77.82 | 81.46 | 80.08 | 0.88 | 0.59 | 77.42 | 80.85 | 79.97 | 0.87 | 0.58 |
| LR | C = 1 | 77.98 | 82.50 | 80.77 | 0.86 | 0.60 | 76.12 | 81.57 | 79.57 | 0.86 | 0.58 |
| MLP | a = tanh, HL = (19,), m = 200, s = adam | 77.02 | 82.77 | 80.50 | 0.86 | 0.59 | 78.88 | 77.94 | 78.90 | 0.87 | 0.57 |
| KNN | al = ball_tree, nn = 20, w = distance | 76.17 | 79.06 | 77.91 | 0.85 | 0.55 | 77.74 | 75.00 | 76.97 | 0.86 | 0.53 |
g, gamma; ne, n_estimators; k, kernel; a, activation; HL, hidden layer size; s, solver; al, algorithm; w, weight; m, max_iter; nn, n_neighbors.
The performance of different machine learning techniques-based models on PRR dataset developed using PSSM-400 of protein sequences.
| SVM | C = 10, g = 0.5, k = rbf | 77.80 | 85.89 | 82.78 | 0.87 | 0.64 | 79.74 | 85.46 | 83.64 | 0.89 | 0.66 |
| LR | C = 1,000 | 77.31 | 86.37 | 82.84 | 0.87 | 0.64 | 80.80 | 81.07 | 81.13 | 0.89 | 0.61 |
| KNN | al = ball_tree, nn = 6, w = distance | 72.80 | 83.48 | 79.36 | 0.86 | 0.57 | 78.40 | 82.50 | 81.07 | 0.87 | 0.60 |
| RF | ne = 80 | 75.95 | 85.01 | 81.55 | 0.87 | 0.61 | 79.07 | 81.41 | 80.74 | 0.86 | 0.60 |
| MLP | a = logistic, HL = ( | 75.26 | 85.09 | 81.28 | 0.86 | 0.61 | 79.07 | 81.03 | 80.26 | 0.88 | 0.59 |
| ET | ne = 70 | 80.33 | 78.79 | 79.36 | 0.88 | 0.58 | 83.73 | 74.97 | 79.15 | 0.87 | 0.59 |
g, gamma; ne, n_estimators; k, kernel; a, activation; HL, hidden layer size; s, solver; al, algorithm; w, weight; m, max_iter; nn, n_neighbors.
The performance of different machine learning techniques-based models on PRR dataset developed using the combination of composition (AAC) and evolutionary information (PSSM-400) based features for protein sequences.
| MLP | a = tanh, HL, = (70,), m = 200, s = adam | 77.70 | 86.54 | 83.20 | 0.88 | 0.65 | 81.23 | 85.50 | 84.19 | 0.90 | 0.67 |
| LR | C = 1,000 | 82.97 | 83.49 | 83.34 | 0.89 | 0.66 | 83.59 | 81.49 | 82.67 | 0.90 | 0.64 |
| RF | ne = 60 | 80.32 | 83.24 | 82.16 | 0.88 | 0.63 | 80.43 | 82.44 | 82.16 | 0.87 | 0.63 |
| ET | ne = 100 | 77.72 | 85.25 | 82.35 | 0.89 | 0.63 | 78.96 | 83.75 | 82.13 | 0.88 | 0.63 |
| SVC | C = 5, g = 0.01, k = rbf | 81.65 | 83.35 | 82.73 | 0.89 | 0.65 | 80.62 | 81.56 | 81.72 | 0.88 | 0.62 |
| KNN | al = ball_tree, nn = 20, w = distance | 80.20 | 76.60 | 78.12 | 0.87 | 0.56 | 80.41 | 72.88 | 76.35 | 0.86 | 0.52 |
g, gamma; ne, n_estimators; k, kernel; a, activation; HL, hidden layer size; s, solver; al, algorithm; w, weight; m, max_iter; nn, n_neighbors.
Figure 4Receiver operating characteristic curves for five-fold cross-validation based on AAC, PSSM, AAC+PSSM using Support vector machine (SVM), Logistic Regression (LR), and Multi-layer perceptron (MLP) based classifier, respectively.
The performance of different machine learning techniques-based models on test dataset when combined with BLAST hits at e-value 10−3.
| PSSM | RF | C = 80 | 83.24 | 96.72 | 91.39 | 0.95 | 0.82 |
| AAC | RF | C = 100 | 82.12 | 94.53 | 89.62 | 0.92 | 0.78 |
| AAC+PSSM | ET | ne = 100 | 87.15 | 89.78 | 88.74 | 0.95 | 0.77 |
| DPC | SVC | C = 2, g = 0.01, k = rbf | 79.89 | 92.34 | 87.42 | 0.93 | 0.73 |
g, gamma; ne, n_estimators; k, kernel.