| Literature DB >> 29284541 |
Martin A Smith1,2, Stefan E Seemann3,4, Xiu Cheng Quek5,6, John S Mattick5,6.
Abstract
The diversity of processed transcripts in eukaryotic genomes poses a challenge for the classification of their biological functions. Sparse sequence conservation in non-coding sequences and the unreliable nature of RNA structure predictions further exacerbate this conundrum. Here, we describe a computational method, DotAligner, for the unsupervised discovery and classification of homologous RNA structure motifs from a set of sequences of interest. Our approach outperforms comparable algorithms at clustering known RNA structure families, both in speed and accuracy. It identifies clusters of known and novel structure motifs from ENCODE immunoprecipitation data for 44 RNA-binding proteins.Entities:
Keywords: Functional genome annotation; Functions of RNA structures; Machine learning; RNA structure clustering; RNA–protein interactions; Regulation by non-coding RNAs
Mesh:
Substances:
Year: 2017 PMID: 29284541 PMCID: PMC5747123 DOI: 10.1186/s13059-017-1371-3
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Schematic of a pairwise alignment with DotAligner. A dynamic programming matrix is first filled in based on the similarity in sequence and cumulative base-wise pairing probabilities of two RNA sequences (top left: colour intensity indicates cumulative similarity score). A partition function over all pairwise alignments is calculated and interrogated for structural compatibility by stochastic backtracking (not illustrated). Two ensembles over all secondary structures are considered for this purpose (bottom left and top right dot plots: blue lines indicate cumulative base-wise pairing probabilities). The final scoring uses the base pair probabilities in the dot plots. This effectively warps the optimal sequence alignment path (top left, black outline) towards one that includes structural features (top left, blue outline and fill). In the bottom right, the optimal sequence alignment and associated consensus secondary structure is contrasted to that produced by DotAligner, exposing the common structural features hidden in the suboptimal base pairing ensemble of both sequences
Fig. 2Comparison of RNA structure alignment quality as a function of sequence identity. BRAliBase 2.1 reference RNA structure alignments were submitted to five different pairwise alignment algorithms, including the Needleman–Wunsch sequence-only alignment algorithm. Top: The total number of surveyed alignments as a function of pairwise sequence identity. The Matthews correlation coefficient (MCC), the difference in the structural conservation index (Δ-SCI) and RNAdistance calculated topological edit distance between the RNAalifold consensus of the computed alignment and the reference BRAliBase 2.1 alignment consensus are compared in the lower three plots. MCC Matthews correlation coefficient, SCI structural conservation index
Fig. 3Classification of known RNA structures. a Receiving operator characteristic (ROC) curves measuring the classification accuracy of the surveyed algorithms by contrasting their computed similarity matrices to a binary classification matrix of Rfam sequences (1 if the sequences are in the same family or 0 if different). High PID = 56–95 % pairwise sequence identity from the provided Rfam alignment; low PID = 1–55 %. b Precision vs recall curve. c Area under the curve (AUC) of ROC values with 95 % confidence intervals for the top four performing algorithms across five ranges of pairwise sequence identity, as calculated from a variant of the Needleman–Wunsch algorithm with free end gaps. The three replicates correspond to stochastically sampled sequences from Rfam 12.3 (see Additional file 2: Table S1). d Runtime distribution of single-thread computation on a 2.6 GHz AMD Opteron processor (note, a fixed upper limit of 120 s was imposed for CARNA). AUC area under the curve
Fig. 4Comparative clustering benchmark of Rfam sequences and their shuffled controls. Clustering performance metrics of three algorithms on 580 reference Rfam structures and their dinucleotide-shuffled controls. a Sensitivity vs false positive rate. b Qualitative cluster statistics (the horizontal dashed line indicates the real number of clusters from unique Rfam families). CM covariance model
Comparative clustering performance
| Algorithm | Number of | Sensitivity | Specificity | Accuracy |
|---|---|---|---|---|
| clusters | ||||
| DotAligner+OPTICS | 53 | 0.716 | 0.886 | 0.802 |
| GraphClust | 201 | 0.990 | 0.110 | 0.635 |
| NoFold (all CMs) | 62 | 0.866 | 0.965 | 0.916 |
| NoFold (filtered) | 45 | 0.674 | 0.976 | 0.826 |
Fig. 5De novo homologous RNA motif identification. a,b Reachability plots of OPTICS clustering display the OPTICS-derived ordering of points (x-axis) and their distance to the nearest neighbour (y-axis). Colours represent significant clusters. a Clustering of Rfam benchmarking data indicating the distance to the nearest neighbour. b Clustering of 2650 ENCODE eCLIP peaks + 100 Rfam controls that overlap evolutionarily conserved secondary structure predictions. The dominant RNA binding protein in each cluster is displayed next to significant clusters. Those with an asterisk are portrayed below. c Multiple structure alignment generated by mLocaRNA on the sequences from a cluster containing both Rfam SNORNA72 sequences and DKC1 (a snoRNA-binding protein) eCLIP peaks. An unannotated DKC1-bound sequence is marked with an asterisk. d–f RNAalifold-predicted consensus RNA secondary structures: d Structure of the alignment displayed in (c). e Structure of a cluster of impartially detected DKC1-bound snoRNAs. f Structure of a novel UPF1-bound motif. dist. distance