| Literature DB >> 34976025 |
Ching-Hsuan Chien1, Lan-Ying Huang1, Shuen-Fang Lo2, Liang-Jwu Chen3,4, Chi-Chou Liao3, Jia-Jyun Chen5, Yen-Wei Chu1,2,3,5,6,7,8.
Abstract
To change the expression of the flanking genes by inserting T-DNA into the genome is commonly used in rice functional gene research. However, whether the expression of a gene of interest is enhanced must be validated experimentally. Consequently, to improve the efficiency of screening activated genes, we established a model to predict gene expression in T-DNA mutants through machine learning methods. We gathered experimental datasets consisting of gene expression data in T-DNA mutants and captured the PROMOTER and MIDDLE sequences for encoding. In first-layer models, support vector machine (SVM) models were constructed with nine features consisting of information about biological function and local and global sequences. Feature encoding based on the PROMOTER sequence was weighted by logistic regression. The second-layer models integrated 16 first-layer models with minimum redundancy maximum relevance (mRMR) feature selection and the LADTree algorithm, which were selected from nine feature selection methods and 65 classified methods, respectively. The accuracy of the final two-layer machine learning model, referred to as TIMgo, was 99.3% based on fivefold cross-validation, and 85.6% based on independent testing. We discovered that the information within the local sequence had a greater contribution than the global sequence with respect to classification. TIMgo had a good predictive ability for target genes within 20 kb from the 35S enhancer. Based on the analysis of significant sequences, the G-box regulatory sequence may also play an important role in the activation mechanism of the 35S enhancer.Entities:
Keywords: CaMV 35S enhancer; T-DNA activation tagging; gene expression; machine learning; rice
Year: 2021 PMID: 34976025 PMCID: PMC8718795 DOI: 10.3389/fgene.2021.798107
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Data distribution of flanking analyzed genes in rice T-DNA mutants.
| Data source | Number of mutant lines | Gene expression status | Validated genes | |||
|---|---|---|---|---|---|---|
| Ac | NE | ND | Ko | |||
| NCHU | 11 | 26 | 22 | 17 | 0 | 65 |
| Academia Sinica | 316 | 262 | 143 | 13 | 2 | 420 |
| Total | 327 | 288 | 165 | 30 | 2 | 485 |
Ac, activated gene; NE, nonactivated gene; ND, non-detectable gene; Ko, knockout gene.
Validated genes indicate the target genes that were detected by RT-PCR.
NCHU, experimental data were collected from Liang-Jwu Chen’s laboratory.
Academia Sinica, experimental data were collected by Su-May Yu’s research team.
Data distribution of the training dataset and independent-testing dataset.
| Data sources | Training dataset (D300) | Testing dataset (D153) | ||
|---|---|---|---|---|
| Ac | NAc | Ac | NAc | |
| NCHU | 20 | 20 | 6 | 2 |
| Academia Sinica | 130 | 130 | 132 | 13 |
| Total | 150 | 150 | 138 | 15 |
FIGURE 1Flow chart of the TIMgo predictive system.
FIGURE 2Correlation between distance and gene activation. The data were sorted by the distance between the 35S enhancer and the TLS, and the ratio of Ac to NAc genes in each group was calculated. The x-axis is the distance from the 35S enhancer to the TLS of a target gene; the y-axis is the proportion of Ac and NAc genes in each group.
Data distribution of the training dataset and independent-testing dataset.
| Feature |
| Without motif | With motif | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Sp (%) | Sn (%) | Acc (%) | MCC (%) | AUC (%) | Sp (%) | Sn (%) | Acc (%) | MCC (%) | AUC (%) | ||
| Kmer | 6 | 72.7 | 66.0 | 69.3 | 38.8 | 79.0 | 79.3 | 77.3 | 78.3 | 56.7 | 88.1 |
| 7 | 86.7 | 73.3 | 80.0 | 60.5 | 89.1 | 83.3 | 78.7 | 81.0 | 62.1 | 89.7 | |
| 8 | 75.3 | 35.3 | 55.3 | 11.6 | 65.3 | 83.3 | 84.7 | 84.0 | 68.0 | 93.6 | |
| 9 | 84.7 | 85.3 | 85.0 | 70.0 | 93.2 | 86.7 | 85.3 | 86.0 | 72.0 | 93.7 | |
| RevKmer | 6 | 71.3 | 60.7 | 66.0 | 32.2 | 72.7 | 78.0 | 77.3 | 77.7 | 55.3 | 85.7 |
| 7 | 84.7 | 76.0 | 80.3 | 60.9 | 87.9 | 79.3 | 77.3 | 78.3 | 56.7 | 88.1 | |
| 8 | 77.3 | 32.7 | 55.0 | 11.2 | 64.9 | 84.0 | 80.0 | 82.0 | 64.1 | 91.5 | |
| 9 | 74.7 | 88.0 | 81.3 | 63.2 | 90.6 | 84.0 | 84.7 | 84.3 | 68.7 | 92.9 | |
k refers to the maximum k value used in Kmer and RevKmer, with a range of 3-k nucleotides in length for each analysis.
Performance of the first-layer features with the SVM models.
| Feature encoding | Sequence | Cross-validation | Independent testing | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Sp (%) | Sn (%) | Acc (%) | MCC (%) | AUC (%) | Sp (%) | Sn (%) | Acc (%) | MCC (%) | AUC (%) | ||
| CGIs | PROMOTER | 71.3 | 48.7 | 60.0 | 20.5 | 58.5 | 53.3 | 40.6 | 41.8 | −3.7 | 48.2 |
| MIDDLE | 77.3 | 18.0 | 47.7 | −5.8 | 47.2 | 100.0 | 2.2 | 11.8 | 4.7 | 65.0 | |
| DNP | PROMOTER | 56.0 | 64.7 | 60.3 | 20.7 | 64.3 | 26.7 | 71.7 | 67.3 | −1.1 | 45.1 |
| MIDDLE | 59.3 | 62.0 | 60.7 | 21.3 | 60.0 | 60.0 | 53.6 | 54.3 | 8.1 | 48.7 | |
| TNP | PROMOTER | 56.0 | 61.3 | 58.7 | 17.4 | 62.2 | 53.3 | 68.1 | 66.7 | 13.5 | 57.4 |
| MIDDLE | 64.7 | 30.0 | 47.3 | −5.7 | 47.4 | 26.7 | 65.9 | 62.1 | −4.7 | 45.0 | |
| Kmer + Motif | PROMOTER | 86.7 | 85.3 | 86.0 | 72.0 | 93.7 | 73.3 | 85.5 | 84.3 | 43.5 | 79.1 |
| RevKmer + Motif | PROMOTER | 84.0 | 84.7 | 84.3 | 68.7 | 92.9 | 73.3 | 81.2 | 80.4 | 37.8 | 83.6 |
| Kmer | MIDDLE | 92.0 | 84.7 | 88.3 | 76.9 | 94.2 | 66.7 | 86.2 | 84.3 | 40.1 | 86.4 |
| RevKmer | MIDDLE | 85.3 | 72.7 | 79.0 | 58.5 | 88.2 | 53.3 | 68.8 | 67.3 | 14.0 | 66.5 |
| DACC | PROMOTER | 67.1 | 72.7 | 69.9 | 39.8 | 78.6 | 46.7 | 59.4 | 58.2 | 3.7 | 54.6 |
| MIDDLE | 76.5 | 58.0 | 67.2 | 35.1 | 74.1 | 53.3 | 49.3 | 49.7 | 1.6 | 47.5 | |
| TACC | PROMOTER | 60.4 | 58.0 | 59.2 | 18.4 | 60.3 | 13.3 | 63.0 | 58.2 | −14.8 | 41.6 |
| MIDDLE | 59.7 | 56.7 | 58.2 | 16.4 | 57.8 | 46.7 | 45.7 | 45.8 | −4.6 | 45.1 | |
| PseKNC | PROMOTER | 89.9 | 60.7 | 75.3 | 52.9 | 84.5 | 73.3 | 54.3 | 56.2 | 16.5 | 59.1 |
| MIDDLE | 56.4 | 52.7 | 59.5 | 19.1 | 61.7 | 66.7 | 58.0 | 58.8 | 14.7 | 54.5 | |
FIGURE 3Accuracy trend in the second-layer feature selection.
FIGURE 4Accuracy trend of TIMgo for cross-validation and independent testing of data within different distances. Train represents the Acc from fivefold cross-validation with D299. Test represents the Acc from independent testing with D153. The x-axis indicates each distance interval, and the y-axis indicates the predictive accuracy.
Performance of the LADTree model in the second-layer.
| TP | FP | TN | FN | Sn (%) | Sp (%) | Acc (%) | MCC (%) | |
|---|---|---|---|---|---|---|---|---|
| Cross-validation | 149 | 1 | 148 | 1 | 99.3 | 99.3 | 99.3 | 98.7 |
| Independent testing | 123 | 7 | 8 | 15 | 89.1 | 53.3 | 85.6 | 35.3 |
Predictive accuracy of TIMgo for different distance groups.
| Distance from the 35S enhancer (kb) | |||||||
|---|---|---|---|---|---|---|---|
| Dataset | 0–2 | 2–5 | 5–10 | 10–15 | 15–20 | 20–25 | >25 |
| Training set | 100.0% | 100.0% | 100.0% | 97.0% | 100.0% | 100.0% | 100.0% |
| Testing set | 89.0% | 91.0% | 84.0% | 86.0% | 93.0% | 71.0% | 60.0% |
Comparison of TIMgo and EAT-Rice with independent-testing evaluation.
| System | Subset1 | Subset2 | ||||||
|---|---|---|---|---|---|---|---|---|
| Sp (%) | Sn (%) | Acc (%) | AUC (%) | Sp (%) | Sn (%) | Acc (%) | AUC (%) | |
| EAT-Rice | 59.1 | 84.6 | 72.9 | 79.4 | 59.1 | 92.3 | 77.1 | 83.2 |
| TIMgo | 72.7 | 84.6 | 79.2 | 87.4 | 78.3 | 76.7 | 77.6 | 84.4 |