| Literature DB >> 36233123 |
Xiaodan Zhang1,2, Xiaohu Zhou1,2, Midi Wan2, Jinxiang Xuan1,2, Xiu Jin1,2, Shaowen Li1,2.
Abstract
There is evidence that non-coding RNAs play significant roles in the regulation of nutrient homeostasis, development, and stress responses in plants. Accurate identification of ncRNAs is the first step in determining their function. While a number of machine learning tools have been developed for ncRNA identification, no dedicated tool has been developed for ncRNA identification in plants. Here, an automated machine learning tool, PINC is presented to identify ncRNAs in plants using RNA sequences. First, we extracted 91 features from the sequence. Second, we combined the F-test and variance threshold for feature selection to find 10 features. The AutoGluon framework was used to train models for robust identification of non-coding RNAs from datasets constructed for four plant species. Last, these processes were combined into a tool, called PINC, for the identification of plant ncRNAs, which was validated on nine independent test sets, and the accuracy of PINC ranged from 92.74% to 96.42%. As compared with CPC2, CPAT, CPPred, and CNIT, PINC outperformed the other tools in at least five of the eight evaluation indicators. PINC is expected to contribute to identifying and annotating novel ncRNAs in plants.Entities:
Keywords: AutoGluon; ncRNA identification; plant; tool
Mesh:
Substances:
Year: 2022 PMID: 36233123 PMCID: PMC9570155 DOI: 10.3390/ijms231911825
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 6.208
Figure 1(A) Graph showing the accuracy of 100 experiments; (B) graph showing the average accuracy of every 5th experiment out of 100 experiments.
Comparing the performance of different feature selection methods.
| SE | SPC | ACC | MCC | F1 | |
|---|---|---|---|---|---|
| F-test | 99.49 | 90.18 | 94.77 | 89.95 | 94.93 |
| VT | 99.49 | 89.24 | 94.29 | 89.08 | 94.49 |
| VT-F |
| 90.7 |
|
|
|
| RF | 90.35 |
| 93.27 | 86.76 | 93.39 |
| RF-AutoGluon | 99.45 | 90.14 | 94.72 | 89.87 | 94.89 |
Figure 2Differential distribution of ten features in coding RNAs and ncRNAs.
Figure 3Correlation analysis chart of 10 features selected for the classification task.
Performance comparisons among five automated machine learning frameworks and three conventional machine learning models.
| Model | ACC (%) | F1 (%) | AUC (%) | MCC (%) | NPV (%) | PPV (%) | SE (%) | SPC (%) |
|---|---|---|---|---|---|---|---|---|
| AutoGluon |
|
| 95.25 |
|
| 91.55 |
| 90.70 |
| Naive Bayes | 86.97 | 87.96 | 86.93 | 74.83 | 79.22 | 94.65 | 82.16 | 93.63 |
| SVM | 53.14 | 13.67 | 53.37 | 16.93 | 99.33 | 0.07 | 91.65 | 51.51 |
| RFC | 92.10 | 92.26 | 92.09 | 84.26 | 90.45 | 93.73 | 90.86 | 93.47 |
| H2O | 92.98 | 93.38 |
| 86.60 | 86.98 | 98.98 | 88.38 | 98.84 |
| TPOT | 86.14 | 86.19 | 86.15 | 72.29 | 86.18 | 86.10 | 86.28 | 86.00 |
| Autokeras | 93.70 | 94.06 | 94.57 | 87.95 | 88.10 |
| 89.39 |
|
Figure 4Comparing the identification accuracy of nine independent test sets across five tools.
Nine plants’ performance indicators were compared using five tools.
| Species | Tool | SE (%) | SPC (%) | ACC (%) | F1 (%) | PPV (%) | NPV (%) | MCC (%) | AUC (%) |
|---|---|---|---|---|---|---|---|---|---|
|
| PINC |
| 92.46 |
|
|
|
|
| 95.61 |
| CPC2 | 76.01 | 92.91 | 84.45 | 83.04 | 91.50 | 79.42 | 69.92 |
| |
| CPAT | 89.27 | 88.75 | 89.01 | 89.05 | 88.84 | 89.18 | 78.02 | 96.26 | |
| CNIT | 65.65 |
| 80.10 | 76.81 | 92.54 | 73.23 | 62.99 | 94.36 | |
| CPPred | 71.24 | 87.70 | 79.46 | 77.65 | 85.32 | 75.24 | 59.75 | 89.72 | |
|
| PINC |
| 86.84 |
|
| 88.29 |
|
| 92.72 |
| CPC2 | 85.25 | 90.62 | 87.94 | 87.60 |
| 86.00 | 75.99 |
| |
| CPAT | 95.03 | 84.02 | 89.53 | 90.07 | 85.61 | 94.42 | 79.54 | 93.65 | |
| CNIT | 63.23 |
| 76.98 | 73.31 | 87.21 | 71.16 | 56.13 | 90.67 | |
| CPPred | 80.78 | 86.59 | 83.89 | 82.36 | 84.00 | 83.80 | 67.59 | 91.66 | |
|
| PINC |
| 87.00 |
|
|
|
|
| 92.90 |
| CPC2 | 70.14 | 90.96 | 80.56 | 78.30 | 88.58 | 75.30 | 62.49 |
| |
| CPAT | 87.24 | 82.50 | 84.84 | 85.03 | 82.94 | 86.90 | 69.79 | 92.00 | |
| CNIT | 51.95 |
| 72.03 | 65.08 | 87.09 | 65.59 | 48.26 | 89.79 | |
| CPPred | 64.39 | 84.06 | 74.23 | 71.42 | 80.16 | 70.24 | 49.42 | 84.12 | |
|
| PINC |
|
|
|
|
|
|
|
|
| CPC2 | 87.82 | 85.14 | 86.48 | 86.66 | 85.53 | 87.48 | 72.99 | 92.15 | |
| CPAT | 93.55 | 81.73 | 87.64 | 88.33 | 83.66 | 92.68 | 75.81 | 91.13 | |
| CNIT | 62.73 | 86.18 | 74.46 | 71.06 | 81.94 | 69.82 | 50.33 | 91.30 | |
| CPPred | 87.5 | 80.08 | 84.79 | 85.19 | 83.00 | 86.78 | 69.68 | 88.97 | |
|
| PINC |
| 87.22 |
|
| 88.61 |
|
| 93.22 |
| CPC2 | 90.12 |
| 89.62 | 90.6 |
| 87.83 | 79.02 |
| |
| CPAT | 71.69 | 88.82 | 80.25 | 78.41 | 86.54 | 75.79 | 61.42 | 91.84 | |
| CNIT | 65.08 | 88.28 | 76.66 | 73.63 | 84.77 | 71.6 | 54.85 | 90.14 | |
| CPPred | 76.44 | 86.1 | 81.27 | 80.33 | 84.64 | 78.48 | 62.84 | 89.24 | |
|
| PINC |
| 91.94 |
|
| 92.69 |
|
| 95.38 |
| CPC2 | 82.69 |
| 88.28 | 87.57 |
| 84.45 | 77.03 |
| |
| CPAT | 82.9 | 91.39 | 87.14 | 86.57 | 90.59 | 84.24 | 74.56 | 95.10 | |
| CNIT | 55.21 | 92.38 | 73.79 | 67.81 | 87.88 | 67.34 | 51.27 | 92.24 | |
| CPPred | 84.89 | 86.59 | 85.74 | 85.62 | 86.36 | 85.14 | 71.49 | 91.56 | |
|
| PINC |
| 84.53 |
|
|
|
|
|
|
| CPC2 | 67.23 | 86.99 | 77.11 | 74.60 | 83.79 | 72.63 | 55.31 | 90.61 | |
| CPAT | 86.69 | 78.61 | 82.65 | 83.31 | 80.18 | 85.54 | 65.51 | 89.47 | |
| CNIT | 58.76 |
| 73.62 | 69.02 | 83.63 | 68.20 | 49.49 | 88.12 | |
| CPPred | 60.64 | 81.75 | 71.20 | 67.80 | 76.87 | 67.50 | 43.38 | 81.24 | |
|
| PINC |
| 87.69 |
|
|
|
|
| 93.79 |
| CPC2 | 94.38 | 87.32 | 90.85 | 91.16 | 88.16 | 93.95 | 81.91 |
| |
| CPAT | 86.65 |
| 87.68 | 87.55 | 88.46 | 86.93 | 75.38 | 95.71 | |
| CNIT | 75.04 | 85.34 | 80.19 | 79.10 | 83.63 | 77.40 | 60.71 | 92.89 | |
| CPPred | 91.81 | 85.04 | 88.42 | 88.80 | 85.98 | 91.21 | 77.02 | 94.28 | |
|
| PINC |
| 90.38 |
|
|
|
|
| 95.04 |
| CPC2 | 90.81 | 90.88 | 90.85 | 90.84 | 90.87 | 90.82 | 81.70 |
| |
| CPAT | 76.52 |
| 83.96 | 82.64 | 89.82 | 79.63 | 68.67 | 95.07 | |
| CNIT | 65.24 | 90.10 | 77.69 | 74.49 | 86.78 | 72.24 | 57.15 | 92.50 | |
| CPPred | 84.83 | 87.92 | 86.38 | 86.16 | 87.54 | 85.29 | 72.80 | 93.05 |
Figure 5ROC curves for 5 tools on 9 plants.
Figure 6The PR curves obtained by PINC and four existing tools.
Figure 7Overall workflow: (A) dataset construction: a dataset was constructed using four species of plants together for training and validation, and nine independent test sets were constructed for testing; (B) feature extraction: features were extracted from the original sequence species and redundant features were filtered out using feature selection methods; (C) model construction: a stacking strategy was used to integrate multiple models.
Training set data for the model.
| Species | Noncoding | Coding | ||
|---|---|---|---|---|
| Total | Used | Total | Used | |
|
| 45,910 | 2000 | 27,416 | 2000 |
|
| 8599 | 2000 | 71,358 | 2000 |
|
| 11,338 | 2000 | 42,189 | 2000 |
|
| 4301 | 2000 | 55,564 | 2000 |
| Total | 70,148 | 8000 | 196,527 | 8000 |
Detailed description of the training set data.
| Size | ||
|---|---|---|
| Non-coding RNAs | Long ncRNAs | 1800 |
| Small ncRNAs | 200 | |
| Coding RNAs | mRNAs | 2000 |
| Overall | 4000 |
Figure 8Distribution of positive and negative sample lengths in the benchmark dataset.
Plant dataset for testing.
| Species | Coding | Noncoding | Total |
|---|---|---|---|
|
| 2099 | 2099 | 4198 |
|
| 5622 | 5622 | 11,244 |
|
| 4682 | 4682 | 9364 |
|
| 2808 | 2808 | 5616 |
|
| 2059 | 2063 | 4122 |
|
| 1708 | 1708 | 3416 |
|
| 8282 | 8282 | 16,564 |
|
| 8657 | 8657 | 17,314 |
|
| 7406 | 7406 | 14,812 |
All features considered in this paper.
| Features | Description | Source |
|---|---|---|
| k-mer frequency | 1–3 k-mer = 84 | PINC |
| 1 nt = 4 features; 2 nt = 16 features | ||
| 3 nt = 64 features | ||
| Score | Values >800 are likely to be a protein, >1000 must be protein | txCdsPredict |
| cdsStarts | NT position of CDS starts from the transcript and is based on zero | txCdsPredict |
| cdsStop | nt position for the CDS end | txCdsPredict |
| cdsSizes | cdsStop-cdsStart | txCdsPredict |
| cdsPercent | (cdsStop + cdsStart)/total nt sequence size | txCdsPredict |
| Sequence length | Total nucleotide length of the sequence | PINC |
| GC content |
| PINC |