| Literature DB >> 19744996 |
Thao T Tran1, Fengfeng Zhou, Sarah Marshburn, Mark Stead, Sidney R Kushner, Ying Xu.
Abstract
MOTIVATION: The computational identification of non-coding RNA (ncRNA) genes represents one of the most important and challenging problems in computational biology. Existing methods for ncRNA gene prediction rely mostly on homology information, thus limiting their applications to ncRNA genes with known homologues.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19744996 PMCID: PMC2773258 DOI: 10.1093/bioinformatics/btp537
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Ensemble statistics. Boxplots for the (A) overall compactness and (B) within cluster sum of squares versus sequence lengths for ncRNAs (Positive936) and their decoys (Dishuffle936). The outliers indicated by the tick marks are values more than two times the inter-quartile range. In general, ncRNAs tended to have fewer clusters that were denser (lower compactness measure) than their decoys, and their within-cluster sum of squares were generally smaller than that of their decoys.
Fig. 2.Structural statistics. Boxplots for the (A) hairpin-loop count and (B) total internal-structure count (internal-loop and bulges) versus lengths for ncRNAs (Positive936) and their decoys (Dishuffle936). The outliers indicated by the tick marks are values more than two times the inter-quartile range. In general, ncRNAs tended to have more loop regions and fewer internal-loops on average than their decoys.
The mean and variance for each feature's AUROC value
| Features | Mean (AUROC) | Var (AUROC) |
|---|---|---|
| rnacluster_wss_point | 0.7316 | 5.09E-05 |
| rnacluster_maxcompactness | 0.6437 | 7.14E-05 |
| entropy_entropy | 0.6329 | 5.54E-05 |
| structuralstatistics_stem_ave | 0.6325 | 4.51E-05 |
| diversity_ensemble_diversity | 0.6263 | 8.10E-05 |
| rnacluster_overallcompactness | 0.6249 | 8.07E-05 |
| rnacluster_avecompactness | 0.6059 | 3.18E-05 |
| rnacluster_num_hifreq_bp_ensemble | 0.6041 | 6.27E-05 |
| rnacluster_bss_point | 0.6039 | 7.73E-05 |
| rnacluster_ave_bpdist_mfe_ensemble | 0.6011 | 7.29E-05 |
| rnacluster_compactnesslargest | 0.5988 | 8.54E-05 |
| structuralstatistics_stem_count | 0.5960 | 6.55E-05 |
| rnacluster_ave_num_hifreq_bp_percluster | 0.5901 | 2.00E-05 |
| structuralstatistics_mfe | 0.5900 | 7.04E-06 |
| diversity_free_energy_thermo_ensemble | 0.5856 | 5.94E-06 |
| structuralstatistics_total_internal_count | 0.5848 | 7.69E-05 |
| rnacluster_bss | 0.5747 | 7.59E-05 |
| rnacluster_nclusters | 0.5630 | 3.99E-05 |
| rnacluster_wss | 0.5601 | 6.19E-05 |
| structuralstatistics_total_internal_nt | 0.5551 | 7.77E-05 |
| structuralstatistics_loop_ave | 0.5402 | 8.60E-05 |
| rnacluster_nlargest | 0.5292 | 7.93E-05 |
| rnacluster_mincompactness | 0.5237 | 7.99E-06 |
| structuralstatistics_multiloop_ave | 0.5169 | 5.11E-05 |
| structuralstatistics_loop_count | 0.5133 | 5.96E-05 |
The normalized feature values from ‘Positive936’ were compared to 1000 runs of its ‘Dishuffle936’ to assess each feature's ability to discriminate between ncRNAs and the corresponding di-shuffled set. The performance shown is organism-independent to allow for an unbiased comparison among the features. Over 10 features have an average AUROC value above 0.6 that is highly stable across 1000 runs, each of which uses a different negative set.
Fig. 3.A schematic of the classifier architecture used for genome-wide prediction. The results of each NN-based classifier were then post-processed and combined into a final NN-based classifier to make the final prediction. The output of the length-specific NN-based classifiers and voting classifier were labeled by score r for 0 ≤ i ≤ N and score s, respectively.
AUROC values of our predictions for E.coli and S.solfataricus using its optimal three window sizes for both the direct and reverse complement strands
| Organism | Strand | AUROC | ||
|---|---|---|---|---|
| + | 0.7182 | 0.6638 | 0.7557 | |
| − | 0.6457 | 0.7275 | 0.7628 | |
| + | 0.6235 | 0.7614 | 0.7502 | |
| − | 0.5149 | 0.8224 | 0.7214 |
Comparison of prediction accuracies by different programs for E.coli
| Program | No. of predictions | PPV | |
|---|---|---|---|
| Carter | 563 | 0.3441 | 0.0568 |
| Chen | 227 | 0.2903 | 0.1189 |
| Rivas | 275 | 0.4086 | 0.1382 |
| Saestrom | 306 | 0.1183 | 0.0359 |
| Wang | 420 | 0.0753 | 0.0167 |
| Tran | 601 | 0.4086 | 0.0632 |
The number of predictions, sensitivity [S = TP/(TP + FN)] and positive prediction value [PPV = TP/(TP + FP)] is given for each program (Carter et al., 2001; Chen et al., 2002; Rivas et al., 2001; Saetrom et al., 2005; Wang et al., 2006).
Fig. 4.Analysis of predicted ncRNA candidates 9, 11 and 12. For the Northern analysis, 30 μg of total RNA was loaded in each lane and transcript sizes were estimated using a New England Biolabs low range ssRNA ladder. (A) Analysis of candidate 9. RNA isolated from a culture of MG1693 (rne+) at various times throughout exponential and stationary phase and separated on a 6% PAGE as described in ‘Methods’ section. (B) Analysis of candidate 11. Total RNA from exponentially growing MG1693 (rne+) and SK3564 (Δrne) was separated on an 8% PAGE. (C) Analysis of candidate 12. Total RNA from exponentially growing MG1693 (rne+) and SK3564 (Δrne) were separated on a 6% PAGE. (D) Candidate 11 falls within the 5′ UTR of the mreB gene. RNAstar secondary structure prediction of a portion of the mreB leader (nucleotides from −269 to −51). Nucleotides shown in red at positions −269 and −106 correspond to the primer extension products detected by Wachi et al. (2006). Position −106 was originally identified as a potential transcription start site but may in fact represent an RNase E cleavage site since it occurs in a single-stranded A/U rich region and there is no apparent σ70 upstream of this site. Furthermore, we hypothesize that the distal stem-loop that ends of −51 represents a rho-independent transcription terminator that is functional in the Δrne strain.