| Literature DB >> 22689765 |
Steffen Heyne1, Fabrizio Costa, Dominic Rose, Rolf Backofen.
Abstract
MOTIVATION: Clustering according to sequence-structure similarity has now become a generally accepted scheme for ncRNA annotation. Its application to complete genomic sequences as well as whole transcriptomes is therefore desirable but hindered by extremely high computational costs.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22689765 PMCID: PMC3371856 DOI: 10.1093/bioinformatics/bts224
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.RNA secondary structure encoding and Graph Kernel Features: Top (A) The graph encoding preserves the nucleotide information (vertex labels) and the base pairs (edge labels), here depicted with different colors. (B) Additional vertices are inserted to induce features related to stacking base-pairs quadruplets (thin light gray vertices at the center of each stacking pair). Right: example of features induced by the graph kernel NSPDK for a pair of vertices u, v at distance 3 with radius 0,1,2. Neighborhood graphs are enclosed in dashed ovals
Fig. 2.Complete clustering pipeline diagram. Phases that are executed in parallel are represented in stacked boxes. (1) filter near-duplicates, (2) compute suboptimal structures, (3) compute sparse vector encoding, (4) compute global feature index and return top dense sets, (5) refine clusters with structural alignment procedure, (6) build covariance model with remaining high quality instances, (7) populate each cluster with retrieved instances, (8) remove clustered instances and iterate from Step 4 and (9) merge redundant clusters.
Results for Rfam and small ncRNA benchmark set
| Quality (MERGED) | Time (in s) | ||||||
|---|---|---|---|---|---|---|---|
| #Seq | #C | F | Rand | Phase 4 | Time | TimeALL | |
| Rfam benchmark | |||||||
| 0 | 8314 | 8314 | |||||
| 1 | 271 | 5 | 0.882 | 0.888 | 458 | 14 995 | 23 309 |
| 2 | 629 | 14 | 0.834 | 0.932 | 416 | 19 962 | 43 272 |
| 3 | 1076 | 23 | 0.868 | 0.956 | 334 | 15 108 | 58 380 |
| 7 | 2181 | 58 | 0.877 | 0.985 | 154 | 11 964 | 104 940 |
| 15 | 2821 | 130 | 77 | 2491 | |||
| Small ncRNA benchmark | |||||||
| 0 | 720 | 720 | |||||
| 1 | 140 | 10 | 0.942 | 0.945 | 42 | 2434 | 3154 |
| 2 | 232 | 20 | 0.926 | 0.939 | 27 | 3395 | 6549 |
| 3 | 270 | 26 | 0.936 | 0.935 | 17 | 7681 | 14 230 |
| 7 | 329 | 35 | 0.890 | 0.897 | 5 | 250 | 23 186 |
| 15 | 360 | 43 | 1 | 92 | |||
Results for each iteration i on the MERGED partition. Clustering quality is given as F measure and Rand index. The total number of clustered sequences is indicated with #Seq. The total number of clusters after merging is given by #C. Time denotes the total time for iteration i, Time is the total serial time up to iteration i.
Overview
| Species | Type | Method | Input | Size (Mb) | Time | Cluster | MPIavg | SCI>0.5 | Reference |
|---|---|---|---|---|---|---|---|---|---|
| | |||||||||
| Bacteria | Small ncRNAs | Misc | 363 | 0.06 | 6.8 h | 39 | 0.75 | 29 | NCBI ftp |
| Human | Predicted RNA elements | E | 699 | 0.03 | 0.3 h | 37 | 0.52 | 36 | |
| Misc | Small ncRNAs | Rfam | 3900 | 0.51 | 36 h | 130 | 0.64 | 98 | |
| | |||||||||
| Fugu | LincRNAs | RNA-seq | 5877 | 0.09 | 10.3 h | 99 | 0.39 | 16 | |
| Fugu | Predicted RNA elements | RNAz | 11 287 | 1.36 | 13.3 h | 97 | 0.39 | 22 | |
| Fruit fly | Predicted RNA elements | RNAz | 17 765 | 2.15 | 20.4 h | 95 | 0.34 | 23 | |
| Human | LincRNAs | RNA-seq | 31 418 | 5.40 | 3.6 d | 95 | 0.34 | 3 | |
| Human | Predicted RNA elements | E | 37 258 | 1.37 | 5.7 d | 117 | 0.75 | 109 | |
| Human | 3′UTRs | RefSeq | 118 514 | 21.91 | 12.8 d | 106 | 0.34 | 13 | |
| ∑ | 227 081 | 32.88 | 25.7 d | 815 | – | 349 |
Please see text for different parameters influencing the run-times.
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ The table summarizes the datasets used in this study and gives an overview on GraphClust-detected clusters. For each screen, we provide the number of input instances and the sum of their lengths (size). We denote the required serial running time to process the input and list the number of obtained clusters. Next, we report the mean pairwise identity (MPI) and the number of clusters that have a structure conservation index (SCI) above 0.5. These are prime candidates for structured ncRNA classes.