| Literature DB >> 31856726 |
Varun Khanna1,2, Lei Li1,2, Johnson Fung2, Shoba Ranganathan3, Nikolai Petrovsky4,5.
Abstract
BACKGROUND: Toll-like receptor 9 is a key innate immune receptor involved in detecting infectious diseases and cancer. TLR9 activates the innate immune system following the recognition of single-stranded DNA oligonucleotides (ODN) containing unmethylated cytosine-guanine (CpG) motifs. Due to the considerable number of rotatable bonds in ODNs, high-throughput in silico screening for potential TLR9 activity via traditional structure-based virtual screening approaches of CpG ODNs is challenging. In the current study, we present a machine learning based method for predicting novel mouse TLR9 (mTLR9) agonists based on features including count and position of motifs, the distance between the motifs and graphically derived features such as the radius of gyration and moment of Inertia. We employed an in-house experimentally validated dataset of 396 single-stranded synthetic ODNs, to compare the results of five machine learning algorithms. Since the dataset was highly imbalanced, we used an ensemble learning approach based on repeated random down-sampling.Entities:
Keywords: CpG; Imbalanced data; Machine learning; Oligonucleotides; Random Forest; Toll-like receptor 9
Mesh:
Substances:
Year: 2019 PMID: 31856726 PMCID: PMC6924143 DOI: 10.1186/s12860-019-0241-0
Source DB: PubMed Journal: BMC Mol Cell Biol ISSN: 2661-8850
Fig. 1Top 20 motifs in mTLR9 active ODN arranged in clockwise manner based on the absolute difference in the percentage of occurrence in high and low activity groups of ODNs. The width of the ribbon shows average percent composition of the motifs in each group
Fig. 2The effect of top 20 motifs in the high (a) and low (b) mTLR9 activity group of ODNs in the dataset. The darker bars represent a significant difference in the median mTLR9 activity score due to the presence of motif in the ODNs. The dotted line shows the median mTLR9 activity of 0.53 and 0.18 for the ODNs in the high and low activity groups, respectively, in the dataset
Fig. 3Mean and standard deviation of Balanced Accuracy rates of the five classifiers on the twenty bootstrap test samples using k-fold cross-validation scheme. Mean balanced accuracy rate of RF model was greater than all five algorithms in all the folds
Mean and standard deviation (SD) values of the balanced accuracy and Matthews Correlation Coefficient (MCC) for all five learning algorithms in 20 bootstrap test samples. The best values in each fold category are underlined with the overall best in bold
| Algorithm | Cross-validation | Mean balanced accuracy | SD balanced accuracy | Mean MCC | SD MCC |
|---|---|---|---|---|---|
| 5-fold | 0.08 | 0.15 | |||
| GBM | 5-fold | 76.8% | 0.07 | 0.55 | 0.12 |
| SDA | 5-fold | 74.6% | 0.08 | 0.50 | 0.14 |
| SVM | 5-fold | 77.1% | 0.08 | 0.55 | 0.16 |
| NN | 5-fold | 74.1% | 0.07 | 0.50 | 0.13 |
| 10-fold | 0.06 | 0.11 | |||
| GBM | 10-fold | 77.7% | 0.05 | 0.57 | 0.10 |
| SDA | 10-fold | 75.8% | 0.06 | 0.53 | 0.11 |
| SVM | 10-fold | 78.4% | 0.05 | 0.58 | 0.11 |
| NN | 10-fold | 72.9% | 0.05 | 0.48 | 0.10 |
| 15-fold | 0.06 | 0.11 | |||
| 15-fold | 76.9% | 0.06 | 0.11 | ||
| SDA | 15-fold | 73.5% | 0.06 | 0.49 | 0.11 |
| SVM | 15-fold | 76.3% | 0.05 | 0.53 | 0.11 |
| NN | 15-fold | 72.6% | 0.07 | 0.47 | 0.15 |
| GBM | 20-fold | 78.5% | 0.07 | 0.58 | 0.12 |
| SDA | 20-fold | 76.1% | 0.08 | 0.54 | 0.14 |
| SVM | 20-fold | 75.4% | 0.05 | 0.52 | 0.09 |
| NN | 20-fold | 74.9% | 0.07 | 0.52 | 0.13 |
Fig. 4Measured mTRL9 activity values of the 100 top predicted TLR9 active ODNs. The dotted black line is the cutoff value for the ODNs in the high activity group used in building the model
Fig. 5The measured mTLR9 activity value of all the synthesized 24-mer ODNs in the dataset. The ODNs were divided into two groups of high (shown in purple) and low (shown in green) activity using a cutoff score of 0.4, based on the optimal density (OD) results from the Raw-blue reporter cell assay
Fig. 6Flowchart of methodology adopted
Composition of the training and test sets at any instance
| Dataset | Training set | Testing set | Total |
|---|---|---|---|
| High | 94 | 23 | 117 |
| Low | 94 | 23 | 117 |
| Total | 188 | 46 | 234 |
| Prediction set | _ | _ | 6000 |
Features used in this study
| S.no | Feature | Description | Type |
|---|---|---|---|
| 1 | A | Count of A nucleotides | Numerical |
| 2 | T | Count of T nucleotides | Numerical |
| 3 | G | Count of G nucleotides | Numerical |
| 4 | C | Count of C nucleotides | Numerical |
| 5 | d_CG2_1 | Distance between occurrences 2 and 1 of CG motif | Numerical |
| 6 | d_CG3_1 | Distance between occurrences 3 and 1 of CG motif | Numerical |
| 7 | d_CG3_2 | Distance between occurrences 3 and 2 of CG motif | Numerical |
| 8 | d_AG2_1 | Distance between occurrences 2 and 1 of AG motif | Numerical |
| 9 | d_AG3_1 | Distance between occurrences 3 and 1 of AG motif | Numerical |
| 10 | d_AG3_2 | Distance between occurrences 3 and 2 of AG motif | Numerical |
| 11 | d_GG2_1 | Distance between occurrences 2 and 1 of GG motif | Numerical |
| 12 | d_GG3_1 | Distance between occurrences 3 and 1 of GG motif | Numerical |
| 13 | d_GG3_2 | Distance between occurrences 3 and 2 of GG motif | Numerical |
| 14 | d_CC2_1 | Distance between occurrences 2 and 1 of CC motif | Numerical |
| 15 | d_CC3_1 | Distance between occurrences 3 and 1 of CC motif | Numerical |
| 16 | d_CC3_2 | Distance between occurrences 3 and 2 of CC motif | Numerical |
| 17 | d_TCT2_1 | Distance between occurrences 2 and 1 of TCT motif | Numerical |
| 18 | d_TCT3_1 | Distance between occurrences 3 and 1 of TCT motif | Numerical |
| 19 | d_TCT3_2 | Distance between occurrences 3 and 2 of TCT motif | Numerical |
| 20 | d_TTC2_1 | Distance between occurrences 2 and 1 of TTC motif | Numerical |
| 21 | d_TTC3_1 | Distance between occurrences 3 and 1 of TTC motif | Numerical |
| 22 | d_TTC3_2 | Distance between occurrences 3 and 2 of TTC motif | Numerical |
| 23 | d_TGT2_1 | Distance between occurrences 2 and 1 of TGT motif | Numerical |
| 24 | d_TGT3_1 | Distance between occurrences 3 and 1 of TGT motif | Numerical |
| 25 | d_TGT3_2 | Distance between occurrences 3 and 2 of TGT motif | Numerical |
| 26 | PMI1 | Principal Moment of Inertia 1 | Numerical |
| 27 | PMI2 | Principal Moment of Inertia 2 | Numerical |
| 28 | Mu_x | Center of mass in x direction | Numerical |
| 29 | Mu_y | Center of mass in y direction | Numerical |
| 30 | CG1 | Presence of CG at position 1 | Fingerprint |
| 31 | GC1 | Presence of GC at position 1 | Fingerprint |
| 32 | GT1 | Presence of GT at position 1 | Fingerprint |
| 33 | GT18 | Presence of GT at position 18 | Fingerprint |
| 34 | GCG6 | Presence of GCG at position 6 | Fingerprint |
| 35 | GT22 | Presence of GT at position 22 | Fingerprint |
| 36 | GT21 | Presence of GT at position 21 | Fingerprint |
| 37 | CGCG5 | Presence of CGCG at position 5 | Fingerprint |
| 38 | GC5 | Presence of GC at position 5 | Fingerprint |
| 39 | GT12 | Presence of GT at position 12 | Fingerprint |
| 40 | TC9 | Presence of TC at position 9 | Fingerprint |