| Literature DB >> 31067213 |
Chi-Chou Liao1, Liang-Jwu Chen1,2, Shuen-Fang Lo3,4, Chi-Wei Chen5, Yen-Wei Chu1,3,6,7,8.
Abstract
T-DNA activation-tagging technology is widely used to study rice gene functions. When T-DNA inserts into genome, the flanking gene expression may be altered using CaMV 35S enhancer, but the affected genes still need to be validated by biological experiment. We have developed the EAT-Rice platform to predict the flanking gene expression of T-DNA insertion site in rice mutants. The three kinds of DNA sequences including UPS1K, DISTANCE, and MIDDLE were retrieved to encode and build a forecast model of two-layer machine learning. In the first-layer models, the features nucleotide context (N-gram), cis-regulatory elements (Motif), nucleotide physicochemical properties (NPC), and CG-island (CGI) were used to build SVM models by analysing the concealed information embedded within the three kinds of sequences. Logistic regression was used to estimate the probability of gene activation which as feature-encoding weighting within first-layer model. In the second-layer models, the NaiveBayesUpdateable algorithm was used to integrate these first layer-models, and the system performance was 88.33% on 5-fold cross-validation, and 79.17% on independent-testing finally. In the three kinds of sequences, the model constructed by Middle had the best contribution to the system for identifying the activated genes. The EAT-Rice system provided better performance and gene expression prediction at further distances when compared to the TRIM database. An online server based on EAT-rice is available at http://predictor.nchu.edu.tw/EAT-Rice.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31067213 PMCID: PMC6505892 DOI: 10.1371/journal.pcbi.1006942
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Data distribution of flanking genes in rice T-DNA mutants.
| Data Sources | Mutant Line | Gene Expression States | Validated Genes | ||
|---|---|---|---|---|---|
| Ac | NE | ND | |||
| TDNA-DS1 | 226 | 190 | 90 | 13 | 293 |
| TDNA-DS2 | 11 | 26 | 22 | 17 | 65 |
| Sum | 237 | 216 | 112 | 30 | 358 |
a Validated gene indicated flanking gene expression of T-DNA mutants detected by RT-PCR.
b TDNA-DS1 indicated the first collected dataset.
c TDNA-DS2 indicated the second collected dataset.
Fig 1Illustration of three kinds of sequence information used in EAT-Rice construction.
First region (slanted box) indicates UPS1K. Second region (curly bracket) indicates DISTANCE. Third region (double-headed arrow) indicates MIDDLE. The gene coding domain sequence (Gene CDS) of target gene is as grayish white box.
Fig 2Flow chart of system architecture.
The dotted line square indicates two-layer model construction. The solid and dotted circle line used for four kinds of features in 2nd Layer Modules indicates feature combination mechanism.
Fig 3Correlation analysis of enhancer property and the activation ratio of genes.
In the interaction between the enhancer and the target gene, we have summarized four properties including. (A) The distance from the 35S enhancer of the T-DNA insertion site to the TLS of gene. (B) Gene orientation. (C) Orientation of T-DNA insertion (enhancer's orientation). (D) Location of T-DNA insertion (enhancer’s location). US (Up-stream): T-DNA inserts into upstream of target gene, DS (Down-stream): T-DNA inserts into downstream of target gene, IG (Intragenic): T-DNA inserts into intragenic of target gene.
Evaluation of the first layer of SVM feature model.
| Feature Encoding | Sequence | Cross-Validation | Independent-Testing | ||||||||
| N-gram | UPS1K | 90.00 | 0.804 | 0.900 | 0.900 | 0.900 | 64.58 | 0.698 | 0.622 | 0.538 | 0.773 |
| DISTANCE | 53.89 | 0.555 | 0.484 | 0.433 | 0.644 | 64.58 | 0.661 | 0.585 | 0.462 | 0.864 | |
| MIDDLE | 95.00 | 0.980 | 0.950 | 0.956 | 0.944 | 72.92 | 0.815 | 0.772 | 0.846 | 0.591 | |
| Overall | 79.63 | 0.780 | 0.778 | 0.763 | 0.829 | 67.36 | 0.725 | 0.660 | 0.615 | 0.743 | |
| NPC | UPS1K | 56.11 | 0.538 | 0.633 | 0.755 | 0.367 | 60.42 | 0.780 | 0.537 | 0.423 | 0.818 |
| DISTANCE | 50.00 | 0.486 | 0.536 | 0.578 | 0.422 | 54.17 | 0.528 | 0.645 | 0.769 | 0.273 | |
| MIDDLE | 59.44 | 0.621 | 0.610 | 0.634 | 0.555 | 68.75 | 0.780 | 0.667 | 0.577 | 0.818 | |
| Overall | 55.18 | 0.548 | 0.593 | 0.656 | 0.448 | 61.11 | 0.696 | 0.616 | 0.590 | 0.636 | |
| Motif | UPS1K | 82.22 | 0.879 | 0.826 | 0.844 | 0.800 | 50.00 | 0.490 | 0.571 | 0.615 | 0.364 |
| CGI | UPS1K | 51.67 | 0.526 | 0.62 | 0.789 | 0.245 | 50.00 | 0.439 | 0.613 | 0.731 | 0.227 |
| Feature Encoding | Sequence | Cross-Validation | Independent-Testing | ||||||||
| N-gram | UPS1K | 81.11 | 0.888 | 0.811 | 0.811 | 0.811 | 70.83 | 0.638 | 0.759 | 0.846 | 0.545 |
| DISTANCE | 61.67 | 0.613 | 0.615 | 0.611 | 0.622 | 58.33 | 0.743 | 0.444 | 0.308 | 0.909 | |
| MIDDLE | 89.44 | 0.940 | 0.897 | 0.922 | 0.867 | 70.83 | 0.823 | 0.781 | 0.962 | 0.410 | |
| Overall | 77.41 | 0.814 | 0.774 | 0.781 | 0.767 | 66.66 | 0.735 | 0.661 | 0.705 | 0.621 | |
| NPC | UPS1K | 53.89 | 0.535 | 0.638 | 0.811 | 0.267 | 56.25 | 0.669 | 0.571 | 0.538 | 0.591 |
| DISTANCE | 61.67 | 0.627 | 0.623 | 0.633 | 0.600 | 54.17 | 0.675 | 0.421 | 0.308 | 0.818 | |
| MIDDLE | 48.33 | 0.509 | 0.546 | 0.622 | 0.345 | 70.83 | 0.743 | 0.708 | 0.654 | 0.773 | |
| Overall | 54.63 | 0.557 | 0.602 | 0.689 | 0.404 | 60.42 | 0.696 | 0.567 | 0.500 | 0.727 | |
| Motif | UPS1K | 79.44 | 0.844 | 0.798 | 0.811 | 0.778 | 64.58 | 0.661 | 0.691 | 0.731 | 0.545 |
| CGI | UPS1K | 49.44 | 0.471 | 0.480 | 0.466 | 0.522 | 41.67 | 0.484 | 0.588 | 0.769 | 0.000 |
a Overall indicates average performance of models built by UPS1K, DISTANCE and MIDDLE sequence.
Evaluation of the second layer of combination model using NaiveBayesUpdateable.
| Pattern of Feature | Cross-Validation | Independent-Testing | ||||||||
| N-gram | 95.00 | 0.981 | 0.950 | 0.956 | 0.945 | 72.92 | 0.777 | 0.772 | 0.846 | 0.591 |
| NPC | 56.67 | 0.578 | 0.557 | 0.544 | 0.589 | 58.33 | 0.725 | 0.412 | 0.269 | 0.955 |
| CGI | 50.00 | 0.500 | 0.550 | 0.611 | 0.389 | 50.00 | 0.479 | 0.613 | 0.731 | 0.227 |
| Motif | 82.22 | 0.822 | 0.826 | 0.845 | 0.801 | 50.00 | 0.490 | 0.571 | 0.615 | 0.364 |
| CGI+N-gram | 95.00 | 0.982 | 0.950 | 0.956 | 0.945 | 72.92 | 0.783 | 0.772 | 0.846 | 0.591 |
| CGI+NPC | 50.56 | 0.561 | 0.508 | 0.511 | 0.500 | 58.33 | 0.734 | 0.412 | 0.269 | 0.955 |
| CGI+Motif | 82.22 | 0.822 | 0.826 | 0.845 | 0.801 | 50.00 | 0.484 | 0.571 | 0.615 | 0.364 |
| N-gram+NPC | 95.00 | 0.978 | 0.950 | 0.956 | 0.945 | 72.92 | 0.786 | 0.772 | 0.846 | 0.591 |
| Motif+N-gram | 94.45 | 0.987 | 0.944 | 0.945 | 0.945 | 72.92 | 0.753 | 0.772 | 0.846 | 0.591 |
| Motif+NPC | 82.22 | 0.845 | 0.826 | 0.845 | 0.801 | 50.00 | 0.610 | 0.571 | 0.615 | 0.364 |
| CGI+N-gram+NPC | 95.00 | 0.978 | 0.950 | 0.956 | 0.945 | 72.92 | 0.794 | 0.772 | 0.846 | 0.591 |
| CGI+Motif+N-gram | 94.45 | 0.989 | 0.944 | 0.945 | 0.945 | 72.92 | 0.760 | 0.772 | 0.846 | 0.591 |
| CGI+Motif+NPC | 82.22 | 0.849 | 0.826 | 0.845 | 0.801 | 50.00 | 0.617 | 0.571 | 0.615 | 0.364 |
| Motif+N-gram+NPC | 94.44 | 0.986 | 0.945 | 0.956 | 0.934 | 72.92 | 0.758 | 0.772 | 0.846 | 0.591 |
| CGI+Motif+N-gram+NPC | 94.44 | 0.986 | 0.945 | 0.956 | 0.934 | 72.92 | 0.763 | 0.772 | 0.846 | 0.591 |
| Pattern of Feature | Cross-Validation | Independent-Testing | ||||||||
| N-gram | 88.89 | 0.969 | 0.890 | 0.901 | 0.878 | 70.83 | 0.823 | 0.781 | 0.962 | 0.409 |
| NPC | 57.78 | 0.600 | 0.537 | 0.489 | 0.666 | 52.08 | 0.502 | 0.303 | 0.192 | 0.909 |
| CGI | 49.44 | 0.494 | 0.326 | 0.244 | 0.745 | 58.33 | 0.615 | 0.375 | 0.231 | 1.000 |
| Motif | 79.45 | 0.795 | 0.798 | 0.812 | 0.779 | 64.58 | 0.638 | 0.691 | 0.731 | 0.545 |
| CGI+N-gram | 88.89 | 0.967 | 0.890 | 0.901 | 0.878 | 70.83 | 0.841 | 0.781 | 0.962 | 0.409 |
| CGI+NPC | 57.78 | 0.598 | 0.537 | 0.489 | 0.666 | 52.08 | 0.526 | 0.303 | 0.192 | 0.909 |
| CGI+Motif | 79.45 | 0.796 | 0.798 | 0.812 | 0.779 | 64.58 | 0.696 | 0.691 | 0.731 | 0.545 |
| N-gram+NPC | 88.33 | 0.972 | 0.884 | 0.890 | 0.878 | 79.17 | 0.806 | 0.828 | 0.923 | 0.636 |
| Motif+N-gram | 87.78 | 0.975 | 0.872 | 0.834 | 0.922 | 77.08 | 0.841 | 0.814 | 0.923 | 0.591 |
| Motif+NPC | 77.78 | 0.825 | 0.775 | 0.767 | 0.790 | 64.58 | 0.631 | 0.691 | 0.731 | 0.545 |
| CGI+N-gram+NPC | 88.33 | 0.972 | 0.884 | 0.890 | 0.878 | 79.17 | 0.813 | 0.828 | 0.923 | 0.636 |
| CGI+Motif+N-gram | 87.78 | 0.974 | 0.872 | 0.834 | 0.922 | 77.08 | 0.851 | 0.814 | 0.923 | 0.591 |
| CGI+Motif+NPC | 77.78 | 0.823 | 0.775 | 0.767 | 0.790 | 64.58 | 0.644 | 0.691 | 0.731 | 0.545 |
| Motif+N-gram+NPC | 88.33 | 0.978 | 0.879 | 0.846 | 0.922 | 77.08 | 0.830 | 0.814 | 0.923 | 0.591 |
| CGI+Motif+N-gram+NPC | 88.89 | 0.977 | 0.885 | 0.857 | 0.922 | 77.08 | 0.832 | 0.814 | 0.923 | 0.591 |
Fig 4Performance evaluation in different distance ranges.
(A) Assessment of EAT-Rice in different datasets. The value of Train, Test A and Test B are corresponding to left Y axis. Train indicates 5-fold cross-validation of training model. Test A indicates the performance of model with the original independent testing data. Test B indicates the performance of model with the new testing data collected after the EAT-Rice had been constructed. STDEV (cross line histogram) is the standard deviation of these three kinds of values, Train, Test A and Test B, and the value of STDEV is corresponding to the right Y axis (STDEV is non-available in the 0–2 range). (B) Assessment between EAT-Rice and TRIM. Y axis is the performance of accuracy.