| Literature DB >> 28334186 |
Milad Miladi1, Alexander Junge2,3, Fabrizio Costa1, Stefan E Seemann2,3, Jakob Hull Havgaard2,3, Jan Gorodkin2,3, Rolf Backofen1,2,4.
Abstract
MOTIVATION: Clustering RNA sequences with common secondary structure is an essential step towards studying RNA function. Whereas structural RNA alignment strategies typically identify common structure for orthologous structured RNAs, clustering seeks to group paralogous RNAs based on structural similarities. However, existing approaches for clustering paralogous RNAs, do not take the compensatory base pair changes obtained from structure conservation in orthologous sequences into account.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28334186 PMCID: PMC5870858 DOI: 10.1093/bioinformatics/btx114
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Hypothetical example to illustrate the difference between single sequence clustering and clustering using conserved structure. Assume the indicated G-U base pair between the first and last nucleotide in the left-most blue human sequence is part of its correct secondary structure. While single sequence structure prediction (a) fails to predict the G-U pair, information about covariation contained in the alignment (b) yields the correct secondary structure for the human sequence and allows to emphasize covarying base pairs. Taking covariation and conserved structures into account may thus yield an improved clustering
Fig. 2.Representing the constrained folded secondary structure as a graph and feature extraction. (a) Base pairs with a reliability greater than t are set as structure constraints (blue boxes) derived from the alignment consensus structure. A constrained secondary structure prediction is performed for the human sequence, the organism of interest in this example. Plain and, if enough covarying base pairs are found, decorated secondary structure are represented as graphs. (b) Auxiliary vertices (gray) are added to the secondary structure graph to emphasize stacked base pairs. The secondary structure is decomposed into substructures using a graph kernel. Here, only neighborhood subgraphs for and are shown and d = 0 which results in the extraction of single root vertices instead of root vertex pairs. The hashing function H encodes each subgraph as an integer which in turn becomes the index of the subgraph in the sparse feature vector counting subgraph occurrences. Since , the feature is counted twice while the other neighborhood subgraphs are unique. The feature extraction for N-N decorated structures is implemented the same way (Color version of this figure is available at Bioinformatics online.)
Fig. 3.An overview of RNAscClust with steps executed in parallel shown as stacks. The sequence of interest in each alignment is constraint folded to generate a conservation aware secondary structure (steps 1 & 2). The resulting secondary structure graph is transformed into a feature vector in step 3. Using local sensitivity hashing (step 4), candidate clusters are extracted in the fifth step. Then a series of post-processing steps as implemented in GraphClust are invoked (steps 6–8): sequences of each cluster are aligned using LocARNA and only well aligning sequences are retained. A covariance model is generated with Infernal to extend clusters with sequences matching the model. Steps 6–8 are repeated until all candidate clusters are processed
Benchmark dataset statistics: Mean of the subalignment-wise mean PSI (mean PSI), number of subalignments and Rfam families in the benchmark datasets. Only Rfam families with at least three subalignments are counted
| Dataset | Mean PSI | Subalignments | Families |
|---|---|---|---|
| 0.78 | 118 | 28 | |
| 0.73 | 234 | 48 | |
| 0.63 | 166 | 26 | |
| 0.50 | 92 | 10 |
1-Nearest-Neighbor classification performance based on pairwise similarities computed by RNAscClust and GraphClust
| Dataset | ||||||
|---|---|---|---|---|---|---|
Mean ± standard deviation of Recall, Precision and F1-Score for 3-fold stratified cross validation are depicted.
Fig. 4.RNAscClust and GraphClust clustering performances, measured by the Adjusted Rand Index, depending on the mean of the alignment-wise mean pairwise sequence identity (mean PSI) of the Rfam-cliques Low, Medium and High (A) as well as Rfam-ome (B) benchmark sets