| Literature DB >> 29720103 |
Caitlin M A Simopoulos1, Elizabeth A Weretilnyk1, G Brian Golding2.
Abstract
BACKGROUND: In plants, long non-protein coding RNAs are believed to have essential roles in development and stress responses. However, relative to advances on discerning biological roles for long non-protein coding RNAs in animal systems, this RNA class in plants is largely understudied. With comparatively few validated plant long non-coding RNAs, research on this potentially critical class of RNA is hindered by a lack of appropriate prediction tools and databases. Supervised learning models trained on data sets of mostly non-validated, non-coding transcripts have been previously used to identify this enigmatic RNA class with applications largely focused on animal systems. Our approach uses a training set comprised only of empirically validated long non-protein coding RNAs from plant, animal, and viral sources to predict and rank candidate long non-protein coding gene products for future functional validation.Entities:
Keywords: Classifier; Ensemble; Machine learning; Transcript; lncRNA
Mesh:
Substances:
Year: 2018 PMID: 29720103 PMCID: PMC5930664 DOI: 10.1186/s12864-018-4665-2
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Negative training data sets in individual models, and corresponding accuracy, sensitivity, specificity and AUC values
| Training dataset | Negative data | AUC | Accuracy | Specificity | Sensitivity | ||||
|---|---|---|---|---|---|---|---|---|---|
| GB | RF | GB | RF | GB | RF | GB | RF | ||
| 1 | 3000 | 0.940 | 0.943 | 0.962 | 0.956 | 0.988 | 0.990 | 0.548 | 0.404 |
| 1000 | |||||||||
| 3000 | |||||||||
| 2 | 3000 | 0.943 | 0.944 | 0.960 | 0.953 | 0.988 | 0.989 | 0.576 | 0.461 |
| 3000 | |||||||||
| 3 | 3000 | 0.961 | 0.962 | 0.973 | 0.970 | 0.990 | 0.992 | 0.693 | 0.592 |
| 1000 | |||||||||
| 3000 | |||||||||
| 4 | 3000 | 0.962 | 0.966 | 0.972 | 0.967 | 0.990 | 0.990 | 0.725 | 0.640 |
| 3000 | |||||||||
| 5 | 3000 | 0.955 | 0.959 | 0.965 | 0.958 | 0.991 | 0.980 | 0.608 | 0.530 |
| 3000 | |||||||||
| 6 | 4500 | 0.961 | 0.967 | 0.979 | 0.979 | 0.995 | 0.995 | 0.633 | 0.571 |
| 4500 | |||||||||
| 7 | 3000 | 0.963 | 0.967 | 0.976 | 0.971 | 0.993 | 0.992 | 0.700 | 0.603 |
| 4500 | |||||||||
| 8 | 2000 | 0.964 | 0.965 | 0.968 | 0.965 | 0.988 | 0.990 | 0.695 | 0.619 |
| 1000 | |||||||||
| 3000 | |||||||||
Training datasets of random forest (RF) and gradient boosting (GB) individual models are described. The positive training dataset, 436 validated lncRNAs, remained constant throughout all training datasets. Specificity, sensitivity, accuracy and AUC values were found using 10-fold cross validation of all training data
Fig. 1Illustration of ensemble methods. An illustrative example of all four ensemble methods: arithmetic mean, geometric mean, majority vote and the stacking generalizer. Real examples from three different genes are given: gene A represents AT5G44470 a predicted protein, gene B represents At43G09922.1 IPS1 a known lncRNA, and gene C represents At2G18130.1 a known protein coding gene, AtPAP11. Note the final stacking generalizer score of gene B compared to the individual model scores for the gene
Gradient boosting hyper-parameters chosen by grid search for each model
| GB Model # | Learning rate | Maxdepth | Subsample | n estimators |
|---|---|---|---|---|
| 1 | 0.04 | 10 | 0.6 | 100 |
| 2 | 0.04 | 10 | 0.6 | 100 |
| 3 | 0.04 | 10 | 0.6 | 100 |
| 4 | 0.02 | 8 | 0.6 | 100 |
| 5 | 0.02 | 10 | 0.6 | 100 |
| 6 | 0.02 | 10 | 0.6 | 100 |
| 7 | 0.04 | 10 | 0.6 | 100 |
| 8 | 0.04 | 10 | 0.6 | 100 |
Hyper-parameters were chosen by grid search using 30 iterations of 4-fold nested cross validation. The given hyper-parameters corresponded to models with the highest accuracy values of all given hyper-parameter combinations
Evaluation measures of random forest (RF) and gradient boosting (GB) ensemble models
| ML model type | Ensemble type | AUC | MCC | Accuracy | Sensitivity | Specificity |
|---|---|---|---|---|---|---|
| RF | ||||||
| Vote | 0.834 | 0.725 | 0.944 | 0.594 | 0.995 | |
| Arithmetic mean | 0.963 | 0.661 | 0.941 | 0.562 | 0.996 | |
| Geometric mean | 0.963 | 0.706 | 0.941 | 0.555 | 0.997 | |
| Logistic regression | 0.835 | 0.765 | 0.952 | 0.665 | 0.994 | |
| GB | ||||||
| Vote | 0.887 | 0.797 | 0.958 | 0.702 | 0.995 | |
| Arithmetic mean | 0.945 | 0.786 | 0.956 | 0.681 | 0.996 | |
| Geometric mean | 0.940 | 0.750 | 0.949 | 0.601 | 0.999 | |
| Logistic regression | 0.883 | 0.822 | 0.963 | 0.745 | 0.994 |
Statistics for vote, arithmetic mean, and geometric mean models were calculated using outputs of models compared to true labels. Logistic regression evaluation statistics were calculated using the scores found by 10-fold cross validation of O. sativa training data and validated lncRNA sequences
Fig. 2Counts of predicted lncRNAs in A. thaliana, E. salsugineum and O. sativa from the gradient boosting stacking generalizer method and GreeNC database. Counts of predicted lncRNAs in this work from all three species were also compared to predictions recorded in GreeNC. Overlapping predictions of the two methods are represented as shaded bars. The percentages above each bar represent the percent of the total predictions by each method that are shared
Number of transcripts in annotation categories of top ranking lncRNAs in the A. thaliana transcriptome
| Annotation category | Number of annotations |
|---|---|
| Natural antisense lncRNA | 64 |
| Pseudogene | 75 |
| Transposable element gene | 10 |
| Transposase | 46 |
| miRNA primary transcript | 4 |
| Hypothetical protein | 5 |
| Protein | 8 |
| Other | 8 |