| Literature DB >> 23209882 |
Philip H Williams1, Rod Eyles, Georg Weiller.
Abstract
MicroRNAs (miRNAs) are nonprotein coding RNAs between 20 and 22 nucleotides long that attenuate protein production. Different types of sequence data are being investigated for novel miRNAs, including genomic and transcriptomic sequences. A variety of machine learning methods have successfully predicted miRNA precursors, mature miRNAs, and other nonprotein coding sequences. MirTools, mirDeep2, and miRanalyzer require "read count" to be included with the input sequences, which restricts their use to deep-sequencing data. Our aim was to train a predictor using a cross-section of different species to accurately predict miRNAs outside the training set. We wanted a system that did not require read-count for prediction and could therefore be applied to short sequences extracted from genomic, EST, or RNA-seq sources. A miRNA-predictive decision-tree model has been developed by supervised machine learning. It only requires that the corresponding genome or transcriptome is available within a sequence window that includes the precursor candidate so that the required sequence features can be collected. Some of the most critical features for training the predictor are the miRNA:miRNA(∗) duplex energy and the number of mismatches in the duplex. We present a cross-species plant miRNA predictor with 84.08% sensitivity and 98.53% specificity based on rigorous testing by leave-one-out validation.Entities:
Year: 2012 PMID: 23209882 PMCID: PMC3503367 DOI: 10.1155/2012/652979
Source DB: PubMed Journal: J Nucleic Acids ISSN: 2090-0201
Plant species from which miRBase miRNAs were used.
| Taxonomic group | Species |
|---|---|
| Brassicaceae |
|
| Caricaceae |
|
| Embryophyta |
|
| Lycopodiophyta |
|
| Euphorbiaceae |
|
| Fabaceae |
|
| Rutaceae |
|
| Salicaceae |
|
| Solanaceae |
|
| Vitaceae |
|
| Poaceae |
|
| Panicoideae |
|
| Pooideae |
|
Figure 1Data processing flow chart for collection of controls, training, and statistical validation. Known miRNAs from miRBase are used as positive controls. Negative controls are randomly picked short segments from ESTs based on the quantity and length distribution of known miRNAs from each plant species in the test set. All controls are aligned to their respective genome. Alignment location allows collection of attributes. For positive controls, the alignment position also allows the location of the miRNA in relation to upstream or downstream of miRNA* to be determined within the precursor. Only attributes from the correct location are valid for training positive controls. Upstream or downstream attributes are equally valid for negative controls, therefore only one is selected at random. Positive controls were aligned using needleall to determine the similarity between all miRNAs. Negative controls were also aligned. The similarity values from the alignment were used to determine if other highly similar sequences should also be excluded when the one in the leave-one-out set was excluded from training. Each sequence in a leave-one-out set was tested for correct classification by the model just trained using controls in the inclusion set. Counts of correct and incorrect classification were used to calculate sensitivity and specificity.
Figure 2(a) Example of an upstream miRNA with downstream miRNA*. (b) Example of a downstream miRNA with upstream miRNA*. The mature miRNA (in red) may exist within the precursor upstream of the miRNA* (a) or downstream (b). Determining this location is critical for collecting the appropriate attribute set.
Figure 3A comparison between precursors from miRBase and the corresponding OPR predicted from the EST data. miRNA lja-miR167 MIMAT0010087 (in red) is found in precursor MI0010580. No genome is listed in miRBase for this miRNA, and it does not align to any chromosome for that species. It does, however, align to EST [GenBank: BW598483]. The EST lja-miR167 is correctly classified as a miRNA with a predicted precursor highly similar to the one in miRBase.
Base set of attributes.
| Attribute | Description of the attribute in relation to the control or candidate sequence |
|---|---|
| chromLen/position | The ratio of the length of the chromosome over the position on that chromosome |
| ShannonEntropyNorm | Shannon entropy normalized to the sequence length |
| G% | Percentage of G base composition |
| C% | Percentage of C base composition |
| T% | Percentage of T base composition |
| A% | Percentage of A base composition |
| DuplexEnergy | The duplex energy between the miRNAs:miRNAs* |
| DuplexEnergyNorm | The duplex energy normalized to the length of the duplex structure |
| MaxMismatch | Maximum number of mismatches in the duplex structure based on both sides of the structure |
| minMatchPercent | Minimum % match based on length of the duplex structure both sides of the structure |
| DeltaG | Minimum free energy for the stem loop |
| DeltaGnorm | Minimum free energy normalized to the length of the stem loop |
| longestDotSet | Longest run of mismatches in the stem loop |
| longestBracketSet | Longest run of matches in the stem loop |
| loopCountNorm | Number of loop heads normalized to the length of the stem loop |
Extended attribute set from combinations of base attributes.
| Attribute | Description of the attribute in relation to the control or candidate sequence |
|---|---|
| G + T:= G% + T% | Sum of G% and T% |
| G/T:= G%/T% | Ratio from G% to T% |
| G + C:= G% + C% | Sum of G% and C% |
| G/C:= G%/C% | Ratio from G% to C% |
| A + C:= A% + C% | Sum of A% and C% |
| A/C:= A%/C% | Ratio from A% to C% |
| T + A:= T% + A% | Sum of T% and A% |
| T/A:= T%/A% | Ratio from T% to A% |
| G%/ShannonEntropyNorm := G%/ShannonEntropyNorm | Ratio of G% over normalized Shannon entropy |
| C%/ShannonEntropyNorm := C%/ShannonEntropyNorm | Ratio of C% over normalized Shannon entropy |
| T%/ShannonEntropyNorm := T%/ShannonEntropyNorm | Ratio of T% over normalized Shannon entropy |
| A%/ShannonEntropyNorm := A%/ ShannonEntropyNorm | Ratio of A% over normalized Shannon entropy |
| NormEnergyRatio := DeltaGnorm/DuplexEnergyNorm | Ratio of the normalized DeltaG from the stem loop and normalized miRNAs:miRNAs* duplex energy |
| longestBracket/longestDot := longestBracketSet / longestDotSet | Ratio of longest match over the longest mismatch normalized counts |
Example of attribute usage from one representative training run.
| Attribute usage | |
|---|---|
| 100% | G% |
| 100% | C% |
| 100% | T% |
| 100% | DuplexEnergy |
| 100% | minMatchPercent |
| 100% | DeltaGnorm |
| 100% | G + T |
| 100% | G + C |
| 98% | duplexEnergyNorm |
| 86% | NormEnergyRatio |
| 85% | MaxMismatch |
| 82% | ShannonEntropyNorm |
| 74% | G/T |
| 51% | A% |
| 51% | A + C |
| 28% | chromLen/position |
Attributes with an average decline of 1% or greater when excluded.
| Attribute | Average percentage decline in training accuracy when the attribute is removed |
|---|---|
| DuplexEnergy | 53% |
| T% + A% | 20% |
| DeltaGnorm | 14% |
| longestBracketSet | 10% |
| minMatchPercent | 7% |
| G% + C% | 5% |
| loopCountNorm | 4% |
| MaxMismatch | 4% |
| DuplexEnergyNorm | 1% |
| DeltaG | 1% |
| longestBracket/longestDot | 1% |
Results from exclusion of each of the 13 taxonomic groups.
| Taxonomic grouping | Error count | Total count of miRNAs | % correctly classified | % of full set excluded | Notes |
|---|---|---|---|---|---|
| Embryophyta | 12 | 190 | 94 | 9.16 |
|
| Lycopodiophyta | 0 | 55 | 100 | 2.65 |
|
| Brassicaceae | 0 | 440 | 100 | 20.22 | Dicot, two |
| Caricaceae | 0 | 1 | 100 | 0.05 | Dicot, papaya |
| Euphorbiaceae | 0 | 7 | 100 | 0.34 | Dicot, castor oil plant |
| Fabaceae | 0 | 560 | 100 | 27.00 | Dicot, three legumes |
| Salicaceae | 16 | 73 | 78 | 3.52 | Dicot, poplar tree |
| Solanaceae | 1 | 14 | 93 | 0.68 | Dicot, tomato plant |
| Vitaceae | 9 | 89 | 94 | 4.29 | Dicot, common grape |
| Rutaceae | 0 | 9 | 100 | 0.43 | Dicot, two citrus trees |
| Panicoideae | 8 | 166 | 95 | 8.00 | Monocot, |
| Poaceae | 0 | 404 | 100 | 19.48 | Monocot, rice and sorghum |
| Pooideae | 13 | 66 | 80 | 3.18 | Monocot, |
Results from exclusion of each of the four taxonomic groups.
| Taxonomic grouping | Error count | Total count of miRNAs | % correctly classified | % of full set excluded | Notes |
|---|---|---|---|---|---|
| Embryophyta | 13 | 190 | 93 | 9.16 |
|
| Lycopodiophyta | 0 | 55 | 100 | 2.65 |
|
| Monocotyledons | 0 | 1193 | 100 | 57.52 | Four species |
| Dicotyledons | 0 | 636 | 100 | 30.67 | Twelve species |
Results from exclusion of each of the 3 taxonomic groups.
| Taxonomic grouping | Error count | Total count of miRNAs | % correctly classified | % of full set excluded | Notes |
|---|---|---|---|---|---|
| Primitive | 0 | 389 | 100 | 11.81 |
|
| Monocotyledons | 0 | 1775 | 100 | 30.67 | Four species |
| Dicotyledons | 0 | 946 | 100 | 57.52 | Twelve species |
(a)
| 3284 | Total count of stem loops |
| 938 | Max stem loop length |
| 53 | Min stem loop length |
| 153 | Average stem loop length |
| 132 | Median stem loop length |
(b)
| Count > 300 nt | Count < 300 nt | Count < 350 nt | |
|---|---|---|---|
| Counts | 135 | 3146 | 3194 |
| % | 4.11% | 95.80% | 97.26% |