| Literature DB >> 20927402 |
Thomas K F Wong1, Tak-Wah Lam, Wing-Kin Sung, Siu-Ming Yiu.
Abstract
BACKGROUND: Non-coding RNAs (ncRNAs) are known to be involved in many critical biological processes, and identification of ncRNAs is an important task in biological research. A popular software, Infernal, is the most successful prediction tool and exhibits high sensitivity. The application of Infernal has been mainly focused on small suspected regions. We tried to apply Infernal on a chromosome level; the results have high sensitivity, yet contain many false positives. Further enhancing Infernal for chromosome level or genome wide study is desirable.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20927402 PMCID: PMC2946929 DOI: 10.1371/journal.pone.0012848
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Analysis of the adjacent nucleotide or base pair dependency of known ncRNAs and other candidates reported by Infernal.
|
|
| |||||||
| Family | All known human members in Rfam | Other candidates reported by Infernal | All known members of all species ( | Other candidates ( | Difference ( | All known members of all species ( | Other candidates ( | Difference ( |
| RF01382 | 0 | 8985 | 2.7 | 0.5 | 2.2 | 8.85 | 2.4 | 6.44 |
| RF00017 | 0 | 4704 | 1.02 | 0.71 | 0.32 | 2.14 | 1.45 | 0.7 |
| RF00037 | 33 | 1552 | 0.74 | 0.72 | 0.02 | 3.96 | 2.12 | 1.84 |
| RF00825 | 0 | 1415 | 1.64 | 0.24 | 1.4 | 5.14 | 1.69 | 3.45 |
| RF00711 | 3 | 1019 | 1.32 | 0.31 | 1.01 | 4.4 | 1.63 | 2.77 |
| RF00736 | 18 | 1012 | 1.23 | 0.35 | 0.88 | 4.67 | 1.43 | 3.24 |
| RF00651 | 2 | 913 | 1.37 | 0.52 | 0.85 | 5.44 | 2.21 | 3.23 |
| RF00647 | 0 | 906 | 1.38 | 0.44 | 0.95 | 5.98 | 2.16 | 3.81 |
| RF00464 | 2 | 887 | 1.13 | 0.31 | 0.82 | 5.15 | 1.92 | 3.23 |
| RF00031 | 20 | 687 | 1.28 | 0.57 | 0.71 | 2.6 | 2.06 | 0.54 |
| RF00876 | 32 | 613 | 0.81 | 0.5 | 0.31 | 5.73 | 2.15 | 3.58 |
| RF00693 | 5 | 548 | 1.14 | 0.53 | 0.61 | 4.35 | 1.67 | 2.68 |
| RF00951 | 744 | 521 | 0.44 | 0.72 | −0.28 | 1.41 | 2.06 | −0.65 |
| RF00131 | 3 | 485 | 1.71 | 0.3 | 1.41 | 6.24 | 4.04 | 2.2 |
| RF00001 | 431 | 287 | 0.66 | 0.78 | −0.11 | 1.23 | 1.83 | −0.6 |
| RF00646 | 2 | 270 | 1.52 | 0.41 | 1.11 | 6.02 | 3.15 | 2.87 |
| RF00027 | 12 | 247 | 1.16 | 0.43 | 0.74 | 5.57 | 3.16 | 2.42 |
| RF00685 | 0 | 243 | 1.06 | 0.51 | 0.55 | 4.41 | 2.94 | 1.47 |
| RF00239 | 3 | 192 | 1.83 | 0.57 | 1.26 | 4.99 | 2.36 | 2.63 |
| RF00140 | 0 | 190 | 2.31 | 0.65 | 1.66 | 3.34 | 1.46 | 1.89 |
|
|
|
|
|
|
|
|
The second and the third columns show the number of known members and other candidates reported by Infernal for some families. The column 4,5,6 (or 7,8,9) show the comparison of the dependence of adjacent nucleotides along single-stranded regions (or adjacent base pairs along stacking pair regions) between all known ncRNA members (i.e. full members) in Rfam and the other candidates reported by Infernal of each family. Larger value of (or ) indicates the higher level of dependence between the adjacent single-stranded columns (or paired columns) within the multiple sequence alignment . The table lists the top 20 families with the highest number of candidates reported by Infernal. As we can see, in most of the cases, the level of adjacent dependence along both single-stranded regions and stacking pair regions in known ncRNAs is higher than that in other candidates reported by Infernal. The % of difference between the dependence levels with respect to the values of other candidates are 164% and 109% along single-stranded regions and stacking pair regions respectively. This provides evidence to support our conjecture that the adjacent dependence in human ncRNA molecules should be useful to distinguish real ncRNAs from false positives.
Detailed filtering power of order-1 SCFG model.
| Opt | Opt*0.9 | Opt*0.8 | |||||
| Family | False positives | Infernal | Order-1 | Infernal | Order-1 | Infernal | Order-1 |
| RF01382 | 8985 | 3633 | 1667 | 8985 | 2314 | 8985 | 3211 |
| RF00017 | 4704 | 4704 | 1124 | 4704 | 1359 | 4704 | 1652 |
| RF00037 | 1552 | 1435 | 559 | 1552 | 682 | 1552 | 788 |
| RF00825 | 1415 | 0 | 0 | 0 | 0 | 3 | 0 |
| RF00711 | 1019 | 68 | 12 | 139 | 23 | 308 | 35 |
| RF00736 | 1012 | 696 | 288 | 1012 | 343 | 1012 | 412 |
| RF00651 | 913 | 4 | 1 | 15 | 3 | 54 | 9 |
| RF00647 | 906 | 23 | 0 | 64 | 0 | 170 | 0 |
| RF00464 | 887 | 391 | 83 | 750 | 93 | 887 | 109 |
| RF00031 | 687 | 687 | 256 | 687 | 302 | 687 | 357 |
| RF00876 | 613 | 613 | 53 | 613 | 66 | 613 | 82 |
| RF00693 | 548 | 22 | 27 | 99 | 42 | 417 | 62 |
| RF00951 | 521 | 520 | 518 | 521 | 520 | 521 | 520 |
| RF00131 | 485 | 0 | 0 | 0 | 0 | 0 | 0 |
| RF00001 | 287 | 287 | 280 | 287 | 278 | 287 | 274 |
| RF00646 | 270 | 2 | 0 | 7 | 0 | 38 | 0 |
| RF00027 | 247 | 11 | 1 | 38 | 2 | 154 | 2 |
| RF00685 | 243 | 3 | 0 | 4 | 0 | 10 | 0 |
| RF00239 | 192 | 19 | 32 | 53 | 32 | 152 | 36 |
| RF00140 | 190 | 0 | 4 | 0 | 6 | 14 | 6 |
|
|
|
|
|
|
|
|
|
For each family, let the lowest score resulted by Infernal (or order-1 SCFG model) of all the known members (of all species) in the Rfam database be (or ). Setting the threshold to be (or ) would give the full power of eliminating false positives, but without omitting any existing real members. We define (or ) as an optimal threshold (Opt) for Infernal (or order-1 model). Yet to include possible novel members whose scores are lower than (or ), we also select two thresholds: and (or, and ) to compare the robustness of the filtering power of Infernal and our models. This table shows the detailed filtering power of order-1 SCFG model for the top 20 families in which Infernal reports the most number of false positives originally. As we can see in the table, the new order-1 SCFG model is able to filter more than 50% of those false positives reported by Infernal in all three thresholds. For the other families which are not shown in the list, the order-1 model can also filter over 50% of the false positives.
Use RNAz to further verify whether any of those false positives filtered by order-1 method and those not filtered by order-1 method is ncRNA.
| Filtered by order-1 | Not filtered by order-1 | ||||||
| Threshold | False positives | Estimated as RNA | Total | % | Estimated as RNA | Total | % |
| Opt | 13578 | 221 | 8201 | 2.7% | 1795 | 5377 | 33.4% |
| Opt * 0.9 | 20661 | 383 | 13708 | 2.8% | 1633 | 6953 | 23.5% |
| Opt * 0.8 | 22431 | 417 | 13902 | 3.0% | 1586 | 8529 | 18.6% |
We use another popular software RNAz [8] to further evaluate the candidates filtered by our order-1 method. Only 3% are estimated to be ncRNAs by RNAz. We perform a similar test on the candidates that kept by our order-1 method. About 18–33% are estimated to be ncRNAs by RNAz. Although RNAz may not give an accurate result, it provides some evidence that most of the filtered candidates may be false positives. On the other hand, the candidates that kept by our order-1 method but cannot be confirmed by RNAz should be further evaluated.
Result of Infernal and our order-1 SCFG model on simulated data.
| Opt | Opt*0.9 | Opt*0.8 | |||||
| Family | False positives | Infernal | Order-1 | Infernal | Order-1 | Infernal | Order-1 |
| RF01382 | 14174 | 7115 | 2915 | 14174 | 4022 | 14174 | 5262 |
| RF00037 | 797 | 734 | 315 | 797 | 371 | 797 | 430 |
| RF00031 | 435 | 435 | 114 | 435 | 136 | 435 | 168 |
| RF00559 | 316 | 25 | 0 | 112 | 0 | 316 | 0 |
| RF00736 | 243 | 169 | 97 | 243 | 116 | 243 | 134 |
| RF00693 | 179 | 8 | 9 | 26 | 18 | 139 | 24 |
| RF00876 | 150 | 150 | 2 | 150 | 4 | 150 | 7 |
| RF00825 | 86 | 0 | 0 | 0 | 0 | 0 | 0 |
| RF00711 | 82 | 3 | 6 | 6 | 7 | 12 | 11 |
| RF00239 | 73 | 4 | 15 | 10 | 16 | 55 | 16 |
| RF00661 | 72 | 60 | 54 | 72 | 54 | 72 | 55 |
| RF00379 | 59 | 0 | 0 | 2 | 0 | 16 | 0 |
| RF00464 | 52 | 4 | 4 | 34 | 5 | 52 | 6 |
| RF00001 | 44 | 43 | 37 | 44 | 34 | 44 | 30 |
| RF00651 | 42 | 0 | 0 | 0 | 0 | 2 | 0 |
| RF00188 | 39 | 1 | 0 | 4 | 1 | 24 | 2 |
| RF00614 | 37 | 5 | 0 | 16 | 0 | 37 | 0 |
| RF00131 | 36 | 0 | 0 | 0 | 0 | 0 | 0 |
| RF00519 | 35 | 14 | 0 | 35 | 0 | 35 | 0 |
| RF00468 | 30 | 0 | 0 | 0 | 0 | 4 | 0 |
|
|
|
|
|
|
|
|
|
Figure 1Comparison of the score distribution between false positives and all the known members (in Rfam) of the family RF00017 based on SCFG and order-1 SCFG model.
The set of production rules of Order-1 SCFG.
| State type | Description | Production rule | Score |
|
| pair emitting |
|
|
|
|
| ||
|
|
| ||
|
| left emitting |
|
|
|
|
| ||
|
|
| ||
|
| right emitting |
|
|
|
|
| ||
|
|
| ||
|
| bifurcation |
| 0 |
|
| delete |
|
|
|
|
| ||
|
|
| ||
|
| start |
|
|
|
|
| ||
|
|
| ||
|
| end |
| 0 |
Note that P-type state indicates the pair-wise relationship between two symbols, so the grammar can produce not only a sequence of symbol, but also the corresponding structure of the sequence.
Example of applying production rules to generate a sequence with corresponding structure.
| Position | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| Sequence |
|
|
|
|
|
|
|
|
| Structure |
|
|
|
|
|
|
To generate the above sequence with corresponding structure, steps are:
Note that the single dots, the bars and the double dots at the top of the characters indicate the corresponding base-pairs emitted by state , and .
Figure 2Left: Consensus structure. Right: Tree representation of the consensus structure.
Figure 3A high-level picture of an order-1 SCFG model for the multiple sequence alignment of two sequences.
Figure 4The details of order-1 SCFG model in two base-pair positions.