| Literature DB >> 22768985 |
Michael Bao1, Miguel Cervantes Cervantes, Ling Zhong, Jason T L Wang.
Abstract
Recently non-coding RNA (ncRNA) genes have been found to serve many important functions in the cell such as regulation of gene expression at the transcriptional level. Potentially there are more ncRNA molecules yet to be found and their possible functions are to be revealed. The discovery of ncRNAs is a difficult task because they lack sequence indicators such as the start and stop codons displayed by protein-coding RNAs. Current methods utilize either sequence motifs or structural parameters to detect novel ncRNAs within genomes. Here, we present an ab initio ncRNA finder, named ncRNAscout, by utilizing both sequence motifs and structural parameters. Specifically, our method has three components: (i) a measure of the frequency of a sequence, (ii) a measure of the structural stability of a sequence contained in a t-score, and (iii) a measure of the frequency of certain patterns within a sequence that may indicate the presence of ncRNA. Experimental results show that, given a genome and a set of known ncRNAs, our method is able to accurately identify and locate a significant number of ncRNA sequences in the genome. The ncRNAscout tool is available for downloading at http://bioinformatics.njit.edu/ncRNAscout.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22768985 PMCID: PMC5054157 DOI: 10.1016/j.gpb.2012.05.004
Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN: 1672-0229 Impact factor: 7.691
Figure 1Sensitivity, PPV and percentage of detections that overlap with known ncRNA genes for ncRNAscout on a shuffled E. coli genome Half of the shuffled genome was used as training data and the other half was used as test data. ncRNAscout demonstrated the best performance at a threshold of 6.0 with a PPV of 0.393, sensitivity of 0.213, and percentage of overlap with known ncRNA genes of 1.0.
Figure 2Cross plot showing the distribution of positive (ncRNA) and negative (non-ncRNA) examples in the training set.
Figure 3Correlation of log-likelihood scores between E. coli and Shigella flexneri With a correlation coefficient of approximately 0.928 and an R2 value of 0.8603 when using a linear fit, a linear relationship does exist for log-likelihood scores between E. coli and S. flexneri. This relationship demonstrates that log-likelihood score algorithms of ncRNAscout and smyRNA produce similar outcomes.
ncRNAs detected in four genomes by ncRNAscout and smyRNA, respectively
| Genome source | Nucleotide length (nt) | GC content (%) | No. of known ncRNAs (No. of results detected by ncRNAscout, smyRNA) | Percentage of known ncRNAs detected (%) | No. of detections with MCC > 0.5 |
|---|---|---|---|---|---|
| 1,496,992 | 31.92 | 42 ( | |||
| 4,448,856 | 66.17 | 134 ( | |||
| 2,107,794 | 57.21 | 49 ( | |||
| 2,542,943 | 54.51 | 23 ( |
Note: In columns 4, 5, and 6, results from ncRNAscout are in bold and results from smyRNA are in italics.
Different types of known ncRNAs detected by ncRNAscout and smyRNA in the Acidovorax genome
| ncRNA type | Total No. in each type | No. detected by ncRNAscout | No. detected by smyRNA |
|---|---|---|---|
| SSU_rRNA_bacteria | 3 | 3 | 3 |
| PtaRNA1 | 1 | 0 | 1 |
| Bacteria_small_SRP | 1 | 0 | 1 |
| CRISPR_DR4 | 75 | 75 | 59 |
| tRNA | 46 | 43 | 46 |
| PK-G12rRNA | 3 | 3 | 3 |
| RNaseP_bact_a | 1 | 1 | 0 |
| tmRNA | 1 | 1 | 1 |
| 5S_rRNA | 3 | 3 | 3 |
Results from ncRNAscout and smyRNA on the shuffled Acidovorax genome
| Method | No. of detections | Detections overlapping with ncRNAs (%) | FDR (%) | PPV | Sensitivity | MCC |
|---|---|---|---|---|---|---|
| ncRNAscout | 56 | 14.3 | 88.794 | 0.112 | 0.9696 | 0.3296 |
| smyRNA | 2894 | 0.584 | 99.883 | 0.00117 | 0.239 | 0.0168 |