Literature DB >> 35982399

RNA secondary structure factorization in prime tangles.

Abstract

BACKGROUND: Due to its key role in various biological processes, RNA secondary structures have always been the focus of in-depth analyses, with great efforts from mathematicians and biologists, to find a suitable abstract representation for modelling its functional and structural properties. One contribution is due to Kauffman and Magarshak, who modelled RNA secondary structures as mathematical objects constructed in link theory: tangles of the Brauer Monoid. In this paper, we extend the tangle-based model with its minimal prime factorization, useful to analyze patterns that characterize the RNA secondary structure.
RESULTS: By leveraging the mapping between RNA and tangles, we prove that the prime factorizations of tangle-based models share some patterns with RNA folding's features. We analyze the E. coli tRNA and provide some visual examples of interesting patterns.
CONCLUSIONS: We formulate an open question on the nature of the class of equivalent factorizations and discuss some research directions in this regard. We also propose some practical applications of the tangle-based method to RNA classification and folding prediction as a useful tool for learning algorithms, even though the full factorization is not known.

Entities: Chemical

Keywords: Brauer monoid; RNA folding; RNA pseudoknots characterization

Mesh：

Substances：
RNA

Year: 2022 PMID： 35982399 PMCID： PMC9386957 DOI： 10.1186/s12859-022-04879-5

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.307

Background

RNA

In biological cells, RNA is a molecule that regulates a huge variety of functions. It consists of a long chain of smaller molecules, called nucleotides, bonded sequentially (Adenine (A), Guanine (G), Cytosine (C), and Uracil (U)), known as the primary structure; the first nucleotide of the chain is usually referred as 5’ and the last one as 3’. A secondary structure appears when the RNA molecule folds onto itself creating additional weaker bonds, called Watson-Crick pairs (A-U, C-G) and Wobble pairs (G-U). Figure 1 shows a primary and secondary structure along with its dot-bracket notation, a string in which a pair of matching brackets correspond to a weak bond in the secondary structure and dots unpaired nucleotides. The dot-bracket string can also be represented by a flattened diagram, that is a set of points displayed horizontally (representing the nucleotides) joined by an arc in the upper half part of the diagram (representing the pairs). Since every arc has to connect two dots, every flattened diagram has N arcs and 2N paired dots.

Fig. 1

RNA structures, dot-bracket notation and flattened diagram. Example of a RNA found in Mus musculus (house mouse) [18]. Its primary structure is on the left and the secondary structure is on the right, along with its dot-bracket representation and flattened diagram. Image generated using FORNA [9] Patterns emerging from a secondary structure. Example of various patterns that can emerge from a secondary structure. Blue nucleotides are part of a hairpin, green ones are part of stems, yellow nucleotides are part of a bulge, brown ones are part of an interior loop Depending on the bonds present in the secondary structure, different types of brackets may be needed to avoid ambiguity. The folding process gives rise to some interesting structural features (loops) that can be categorized as hairpins, bulges, stems, interior loops (see Fig. 2), and multiloops (see Fig. 3).

Fig. 2

Patterns emerging from a secondary structure. Example of various patterns that can emerge from a secondary structure. Blue nucleotides are part of a hairpin, green ones are part of stems, yellow nucleotides are part of a bulge, brown ones are part of an interior loop

Fig. 3

A pseudoknotted tRNA. Secondary structure of the yeast phenylalanine tRNA along with its dot-bracket representation [1]. The folding forms a pseudoknot because of the G-C pair at positions 18–50 and pair G-C at position 14–42. There are three multiloops (coloured in red) at the base of the three stems with hairpins It is often the case that RNA secondary structures form a pseudoknot, where an unbonded nucleotide is bonded with another nucleotide in a different loop of the RNA molecule (Fig. 3). Predicting the optimal structure with pseudoknots during the folding process, also known as the RNA folding problem, often requires a prohibitive amount of time. Although great efforts were put to solve this problem, both from an algebraic [2, 19, 20, 22] and a machine learning perspective [25], there is still room for improvements. Due to its pivotal role in biological processes, the study of RNA secondary structures is of great importance. The process of protein production is the result of the interaction of three types of RNA: transfer RNA, ribosomal RNA, and messenger RNA. Viruses have evolved to inject their genome (in the form of RNA) into the host cells in order to replicate themselves. Moreover, it is still in the debate that the self-replicating capabilities of RNA may have given the basis for early life on Earth even before DNA appeared (RNA World Hypothesis [11, 14]). This work proposes a different way to investigate RNA folding with an algebraic structure during the process of optimization, exploiting its decomposition in prime factors.

Brauer monoid

A monoid is an algebraic structure made by a set of elements and an associative binary operator equipped with an identity element. Given a natural N and a set of 2N dots in , where and , a tangle is a set of N pairs (called edges) of distinct dots, such that no dot occurs in more than one edge. Tangles are represented graphically by drawing two rows of N dots labelled with [N] if they are on the top and labelled with if they are on the bottom. All edges are represented by lines connecting pairs of dots. The edge enumeration of a tangle is called invariant and we will represent it by separating edges by commas and pair of dots by colons (see Fig. 5). We can compose two tangles by identifying the bottom row of the first with the top row of the second one and then redraw the edges accordingly (see Fig. 4). The set of all tangles on 2N points under the composition operator is called the Brauer Monoid [3].

Fig. 5

Examples of tangles. a A graphical representation of a tangle in . Its invariant is . b , the identity tangle for

Fig. 4

Examples of tangle composition. Composition of two tangles in . The first tangle is put on top of the second one, then the resulting edges are redrawn to minimize intersections

Examples of tangle composition. Composition of two tangles in . The first tangle is put on top of the second one, then the resulting edges are redrawn to minimize intersections Edges in the form are called transversals, and in the cases when , or we call them positive, negative, and zero transversal respectively. Edges in the form or are called upper and lower hooks respectively [6]. The size of an edge , with a and b arbitrary dots, is defined as . Examples of tangles. a A graphical representation of a tangle in . Its invariant is . b , the identity tangle for is closed under composition and its identity is . A tangle P is called prime if it can only be written in the form . There are two types of primes tangles (Fig. 6):called respectively -prime and -prime. contains exactly -prime and -prime.

Fig. 6

Examples of prime tangles. Two prime tangles in . a A -prime and b a -prime

Examples of prime tangles. Two prime tangles in . a A -prime and b a -prime Note that crossings in a tangle are only introduced by -primes. -primes and -primes are the generators for all tangles in under composition, this means that we can reduce any tangle to a prime factorization. It is useful to note here that factorization in the Brauer Monoid is not unique. A factor list for a tangle X is a list of prime tangles in the form such that their composition gives back X. The length of a factor list is indicated by . The factor list of the identity tangle is the empty list, whose size is . For each tangle , we call the factorization problem the task of finding the factor list of minimal length.

Methods

The first attempt to draw a connection between RNA secondary structures and tangles in the Brauer Monoid was due to Kauffman and Magarshak [12]. Their intuition was that the number of parenthesis in RNA dot-bracket representation and the number of dots in a tangle is always even, and each open parenthesis must correspond to a closed parenthesis somewhere in the string, corresponding with the existence of an edge in a tangle. Therefore, they provided the following procedure for converting an RNA secondary structure to a tangle: As Giegerich et al. pointed out, the study of the shape of an RNA secondary structure lifts the user from the burden of paying attention to changes that do not affect the overall desired structure, which means that we do not lose information because we are doing a static analysis [10]. In this context, the procedure described above gives us the opportunity to study the shape of RNA secondary structures in terms of tangles and generators for these tangles. For this purpose, we wrote an algorithm capable of finding the minimal amount of prime compositions for any given tangle [16]. We classify tangles in the following way: flatten the secondary structure in a single long chain (equivalent to the dot-bracket notation); discard the unpaired nucleotides, there are now 2N nucleotides and N pairs; abbreviate stacked arcs to a single arc. We will call this reduced diagram shape [10, 21]; rotate the second half of the shape diagram above the first; enumerate the nucleotides in the top row with numbers in [N] and nucleotides in the bottom row with numbers in . a tangle (all edges of X are transversal); a tangle (X has a lower hook h of size ); a -tangle with the extra condition of having only -primes as factors (no edge in X intersect with another edge. stands for Temperley-Lieb, those who first described them [23]); all the other tangles ( stands for big hook because they will always have a lower hook h of size .) Types of tangles. A display of our tangle classification criteria. a A -tangle, b a -tangle, c a -tangle, d a -tangle For a visual example see Fig. 7. For each class of tangles, we provide an algorithm for calculating its factorization.

Fig. 7

Types of tangles. A display of our tangle classification criteria. a A -tangle, b a -tangle, c a -tangle, d a -tangle

Factoring -tangles

The set of -tangles on 2N dots is actually isomorphic to the symmetric group , therefore we can represent any -tangle X as a permutation in the formand we can find an optimal factorization by sorting the bottom row of X. Since every -prime is equivalent to an adjacent swap, we are limited to algorithms, like BubbleSort. Ernst et al. defined a factorization algorithm that constructs a minimal factor list given an input -tangle [7]. Their algorithm works by subdividing the tangle to factorize in vertical columns and then enumerating all regions of odd depth (called 1-regions) that this subdivision generates. Each region will correspond to a -prime, and if two regions and are diagonally adjacent, with having a lower depth than , then they write , therefore constructing a Directed Acyclic Graph (DAG) of regions. By reading this graph left to right and from top to bottom, they obtain a minimal factor list. Our implementation of their algorithm takes quadratic time. For a more detailed explanation, the reader can refer to the original paper. Recall that a -tangle is a tangle in the form , we would like to find by removing from X. To do this, we will merge the lower hook with another edge in the tangle. We say that we merge a lower hook and an edge by removing them from X and adding edges a and b such that if e is a hook or a negative transversal, then and and if e is a positive transversal, then and . Since the number of crossings in a tangle corresponds to the number of -primes in its factor list, we would like this merging process to maintain the crossing number constant, in this way we are sure to not include any more -primes in the non-optimal factor list we are calculating.

Heuristic 1

Let be a -tangle with c number of crossings and with a lower hook . Let . For all edges calculate inter(e) to be the number of intersections e has with edges in I. Let be the set of edges that intersect both edges in I, for each calculate the number of crossings the tangle would have if we merged h with e and pick the tangle whose number of crossings is equal to c. If more than one edge satisfies this last condition, among them, pick the edge that has the least amount of intersections in X. Note that, for the case of edges in I, it will happen that some edges in X will share a dot with edges in I. We count them too as intersecting. Merging two edges takes constant time, but the calculation of the crossing number takes [24], and since we have to merge h with N edges in the worst case, the time complexity for this heuristic is . We will extract factors from a -tangle X by transforming it into a -tangle. The idea is to take one of the lower hooks h with size and shrink it until it becomes of size one. To do this we compose X with -primes until this condition is met. During the shrinkage process, other edges will inevitably change size. In order to decide where we should shrink h, we use a heuristic that chooses a location where the size of the other edges increases the least. We apply this heuristic to the smallest lower hook of X, in this way there will be no smaller lower hook inside of it.

Heuristic 2

Given a -tangle X, let be the smallest lower hook of X of size . Let j be the index of the shrinkage location where the size of the other edges increases the least. Shrink the lower hook h into location j by composing X with and . This procedure yields a -tangle such that . The notation indicates the reverse of a factor list, given then . This heuristic is not optimal, but it can be computed in linear time. Rules for prime tangles

Minimal factorization

The heuristics mentioned above do not always yield a minimal factorization, therefore a minimization step is required. It turns out that prime tangles follow a particular set of rules (see Table 1) [13]. We call R1–10 delete rules and R11–13 move rules. We can use them to minimize a non optimal factor lists by implementing them in a rewriting logic tool (we chose the Maude System [5, 15]).

Table 1

Rules for prime tangles

Rule type	Rule id	Rule
Delete	R1	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_i \circ T_i$$\end{document}Ti∘Ti	=	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$I_N$$\end{document}IN
	R2	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$U_i \circ U_i$$\end{document}Ui∘Ui	=	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$U_i$$\end{document}Ui
	R3	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_i \circ U_i$$\end{document}Ti∘Ui	=	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$U_i$$\end{document}Ui
	R4	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$U_i \circ T_i$$\end{document}Ui∘Ti	=	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$U_i$$\end{document}Ui
	R5	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$U_i \circ U_j \circ U_i$$\end{document}Ui∘Uj∘Ui	=	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$U_i$$\end{document}Ui	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\iff$$\end{document}⟺	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\|i-j\| = 1$$\end{document}\|i-j\|=1
	R6	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$U_i \circ T_j \circ U_i$$\end{document}Ui∘Tj∘Ui	=	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$U_i$$\end{document}Ui	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\iff$$\end{document}⟺	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\|i-j\| = 1$$\end{document}\|i-j\|=1
	R7	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_i \circ U_j \circ U_i$$\end{document}Ti∘Uj∘Ui	=	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_j \circ U_i$$\end{document}Tj∘Ui	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\iff$$\end{document}⟺	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\|i-j\| = 1$$\end{document}\|i-j\|=1
	R8	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$U_i \circ U_j \circ T_i$$\end{document}Ui∘Uj∘Ti	=	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$U_i \circ T_j$$\end{document}Ui∘Tj	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\iff$$\end{document}⟺	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\|i-j\| = 1$$\end{document}\|i-j\|=1
	R9	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$U_i \circ T_j \circ T_i$$\end{document}Ui∘Tj∘Ti	=	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$U_i \circ U_j$$\end{document}Ui∘Uj	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\iff$$\end{document}⟺	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\|i-j\| = 1$$\end{document}\|i-j\|=1
	R10	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_i \circ T_j \circ U_i$$\end{document}Ti∘Tj∘Ui	=	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$U_j \circ U_i$$\end{document}Uj∘Ui	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\iff$$\end{document}⟺	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\|i-j\| = 1$$\end{document}\|i-j\|=1
Move	R11	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_i \circ T_j \circ T_i$$\end{document}Ti∘Tj∘Ti	=	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_j \circ T_i \circ T_j$$\end{document}Tj∘Ti∘Tj	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\iff$$\end{document}⟺	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\|i-j\| = 1$$\end{document}\|i-j\|=1
	R12	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_i \circ U_j \circ T_i$$\end{document}Ti∘Uj∘Ti	=	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_j \circ U_i \circ T_j$$\end{document}Tj∘Ui∘Tj	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\iff$$\end{document}⟺	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\|i-j\| = 1$$\end{document}\|i-j\|=1
	R13	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P_i \circ P_j$$\end{document}Pi∘Pj	=	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P_j \circ P_i$$\end{document}Pj∘Pi	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\iff$$\end{document}⟺	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\|i-j\| > 1$$\end{document}\|i-j\|>1

From RNA to tangle factorization

We will now provide an example of the mapping procedure for deriving, from a RNA secondary structure, a tangle with its prime factors. We will start from the modified E. coli tRNA in Fig. 8 [8], and apply Kauffman and Magarshak’s mapping to obtain the flattened diagram in Fig. 9a.

Fig. 8

Modified E. coli tRNA. Pseudoknotted secondary structure for a modified E. coli tRNA along with its dot-bracket representation

Fig. 9

Flattened diagram, shape diagram, and corresponding tangle. a The flattened diagram for the modified E. coli tRNA. b The shape diagram is computed by merging together all parallel edges in the flattened diagram. c The corresponding -tangle in

Modified E. coli tRNA. Pseudoknotted secondary structure for a modified E. coli tRNA along with its dot-bracket representation Flattened diagram, shape diagram, and corresponding tangle. a The flattened diagram for the modified E. coli tRNA. b The shape diagram is computed by merging together all parallel edges in the flattened diagram. c The corresponding -tangle in This diagram is reduced to obtain a shape diagram (Fig. 9b) that can be folded to get the corresponding tangle (Fig. 9c). We can now factorize it by using the methods discussed previously (Fig. 10).

Fig. 10

Factorization steps. The steps (from a to d) that our algorithm takes in order to factorize the tangle (a)

Figure 10 shows the four steps of the factorization algorithm: The algorithm recognizes that X is a -tangle because there is a lower hook of size 1 (). Therefore it can be rewritten as . The algorithm applies Heuristic 1 that determines that the upper hook 2 : 4 in the only one intersecting the two imaginary edges (the two vertical dotted lines) twice. Therefore these two edges are merged and we obtain the tangle . The prime is yielded and the algorithm moves to the next step. The rewritten tangle is a -tangle. The algorithm applies BubbleSort that firstly extracts , thus shrinking the edge to and obtaining . The BubbleSort applies one more swap, which corresponds to a and delivers The algorithm has now reached the identity tangle () and the first part of the factorization process has terminated. Factorization steps. The steps (from a to d) that our algorithm takes in order to factorize the tangle (a) Example of a tRNA with its corresponding tangle and factorization. a A modified E. coli tRNA. b The correspondent abbreviated tangle, with minimal factorization . c The three factors composed. This makes it easier to visualize the path that each edge takes Thus the yielded factorization is . Now the algorithm moves to the rewriting logic step, whose aim is to ensure that this is the minimal factorization and, if it is not, to find a better one. Since there is no move rule that can lead to the application of a delete rule, the algorithm concludes that this factor list is minimal (Fig. 11c).

Fig. 11

Example of a tRNA with its corresponding tangle and factorization. a A modified E. coli tRNA. b The correspondent abbreviated tangle, with minimal factorization . c The three factors composed. This makes it easier to visualize the path that each edge takes

An online interactive demo that calculates these steps automatically is available [17].

Examples

RNA without pseudoknots Figure 12a is an example of a RNA molecule that does not have any pseudoknots, therefore its corresponding tangle will not have any crossings. This implies that it will be mapped to a -tangle, which we know can be factorized using Ernst’s algorithm. To obtain the corresponding tangle we apply Kauffman and Magarshak’s mapping. We take its secondary structure (represented as a flattened diagram in Fig. 12b) and reduce it to a shape diagram (Fig. 12c). The shape diagram can now be folded in half to obtain the tangle in Fig. 13a. We then apply Ernst’s algorithm by dividing it into five columns (Fig. 13b), i.e. by drawing imaginary edges that connect each upper dot to its corresponding bottom dot, and selecting for each of them the regions of odd depth (Fig. 13c). We then build the DAG by connecting two regions and if they are diagonally adjacent and is above (Fig. 13d). To each node will now correspond a region, and each edge will indicate when two regions are diagonally adjacent. We then read the graph nodes from top to bottom and from left to right. If a node is in column i, then we will write in output the prime tangle (Fig. 13e).

Fig. 12

Fig. 13

Example 1: Factorization. a The tangle obtained from the shape diagram. b The tangle divided into five columns. c The regions of odd depth are colored in gray. d The DAG obtained by Ernst’s algorithm. e The minimal factorization for the initial tangle

Example 1: RNA. a A pseudoknot free RNA secondary structure along with its primary structure and dot-bracket representation. b The flattened diagram (unpaired nucleotides are not drawn due to space constraints). c The shape diagram obtained by collapsing parallel edges onto a single one Example 1: Factorization. a The tangle obtained from the shape diagram. b The tangle divided into five columns. c The regions of odd depth are colored in gray. d The DAG obtained by Ernst’s algorithm. e The minimal factorization for the initial tangle RNA with pseudoknots Suppose to have a complex RNA secondary structure that yields the tangle in Fig. 14a. Since it is a , for this example our algorithm applies Heuristic 2 on the smallest lower hook (in this case there is only one, namely ). To choose where we should shrink this lower hook, the algorithm calculates which shrinkage location increases the size of the other edges the least (Table 2).

Fig. 14

Table 2

This table calculates, for each edge inside lower hook , how much it would increase (or decrease) in size if the algorithm shrunk lower hook into shrinkage locations from 1 to 5

Shrinkage location	3:3′	7:4′	6:5′	5:6′	Sum	Factors
1	+1	-1	-1	+1	0	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_6 \circ T_5 \circ T_4 \circ T_3$$\end{document}T6∘T5∘T4∘T3
2	+1	-1	-1	+1	0	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_2 \circ T_6 \circ T_5 \circ T_4$$\end{document}T2∘T6∘T5∘T4
3	+1	+1	-1	+1	2	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_2 \circ T_3 \circ T_6 \circ T_5$$\end{document}T2∘T3∘T6∘T5
4	+1	+1	+1	+1	4	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_2 \circ T_3 \circ T_4 \circ T_6$$\end{document}T2∘T3∘T4∘T6
5	+1	+1	+1	-1	2	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_2 \circ T_3 \circ T_4 \circ T_5$$\end{document}T2∘T3∘T4∘T5

The best shrinkage location is selected among those who have the minimal sum of these sizes (1 and 2 in this case). The rightmost column indicates which set of prime factors, when composed with the initial tangle, shrink the in the selected location

Example 2. a The -tangle to be factorized. Circled numbers index all possible shrinkage locations for the lower hook . b Apply the Heuristic for -tangles. The dashed lines indicate the edges in the set . c Apply the Heuristic for again. d The resulting -tangle, it can be factorized optimally by using Algorithm 1 This table calculates, for each edge inside lower hook , how much it would increase (or decrease) in size if the algorithm shrunk lower hook into shrinkage locations from 1 to 5 The best shrinkage location is selected among those who have the minimal sum of these sizes (1 and 2 in this case). The rightmost column indicates which set of prime factors, when composed with the initial tangle, shrink the in the selected location Since in this case the heuristic found two best locations, 1 and 2, it randomly chooses location number 1. Therefore will be shrunk to a lower hook and the factorization yielded so far is , the reverse of the factorization for this location (we record the reverse because if during factorization we need to shrink the lower hook, during composition we need to expand it). The algorithm now tries to factorize the tangle returned from the last step (Fig. 14b). Since it is a -tangle, the algorithm will apply Heuristic 1. It will select the lower hook and check which edges intersect with the imaginary edges in the set . The only edge intersecting both is , therefore and are merged together. This step returns the tangle in Fig. 14c and yields the prime factor . Since the tangle in Fig. 14c is still a -tangle, the same step is applied again, returning the -tangle in Fig. 14d and yielding the prime factor . This last tangle can be factorized optimally by applying Algorithm 1, which yields the factorization by performing the following steps:This last step returns the identity tangle, therefore the algorithm stops and yields the factorization . This factorization is minimal therefore the reduction step is not necessary. Reduction of a non-minimal factorization Suppose the following non-minimal factorization term is given: . The rewriting logic step minimizes the term by performing the following rewrites using the rules presented in Table 1.In the last step there are no delete rules applicable and no move rules that eventually lead do a delete. Therefore this factor list is minimal.

Results

The resulting tangle is invariant to synonymous mutations, which are mutations that do not change the secondary structure. This is due to the fact that we discard unpaired nucleotides and abbreviate stacked arcs, allowing multiple secondary structures to map to the same factorization. This also allows researchers to move their attention to patterns in the factorizations of their desired shapes. A less obvious result (already observed by Kauffman and Magarshak) is that every secondary structure without pseudoknots maps to a -tangle. The intuition behind this result is that the number of valid ways we can arrange 2N open and closed parenthesis of a single type is the Catalan numberwhich is exactly the number of tangles with non-crossing edges in [4, 23]. This also implies that every pseudoknotted secondary structure corresponds to a tangle with at least one crossing, and thus at least one -prime as a factor. Let us show some other properties using the example we provided in the previous section (Fig. 11). In the corresponding tangle, only stems and pseudoknots are visible and they are encoded in the factorization. Starting from stem , six pairs are identified with the unique vertical edge, which does not have corresponding factors. Its presence, however, causes the indexes of the prime tangles to be shifted by one (Proposition 1). The three pairs of the stem correspond to the arc generated by the factor . The stem , corresponding to the edge , is generated by . This is because its two endpoints were situated in the first and second half of the flattened secondary structure, causing it to be represented as a diagonal edge. The stem , identified with the edge 2 : 4, is generated by (note that and can commute, see “Discussion” section). Lastly, the pseudoknots are identified with edge generated by and , which are the factors in common with the edges that it crosses, 2 : 4 and (Proposition 3). We will give a mathematical foundation for these empirical results. Given a section s of an RNA secondary structure, stem or pseudoknot, we write to denote its corresponding edge in the RNA shape (or tangle) beginning in position i and ending in position j (with ). Given a tangle X and an edge , we will write gen(e) to indicate the factors that generate it.

Proposition 1

If an RNA secondary structure has a stem s with , then the index of every factor of the corresponding tangle will always be greater or equal to two. The converse is also true.

Proof

Assume that an RNA shape has an edge . Let X be the corresponding tangle, then and therefore there is no prime or in the factorization of X. The backward argument is also valid.

Proposition 2

Let s be a stem of an RNA secondary structure and let p be a pseudoknot starting inside the hairpin of s and ending outside of it. Then edge(s) will cross edge(p). We can abstract edge(s) to be a 2-dimensional closed curve by closing its two ends with a horizontal line. We then have that edge(p) starts inside of and ends outside of it. By the Jordan Curve Theorem on we know that edge(p) must cross , and since we assume that in the shape diagram all edges are situated in the upper portion of the diagram we know that edge(p) must cross edge(s).

Proposition 3

Let X be a tangle with and let . If and cross, then there exists for some i. Since and cross, they must share a prime tangle P that generates their crossing. But since every intersection is generated by a -prime, P must be a -prime. This implies that a -prime generates both and .

Discussion

The existence of equivalent factorizations leads us to reason about an open question:

Open Question

What is the biological interpretation of commutative factors and, in general, of equivalent factorizations? We hypothesise two separate research directions, regarding:The reason for this distinction is that R13 does not really impose a challenge during factorization, recall that R13 is defined as:The number of prime factors and remains unchanged, whereas in R11 and R12:The number of s is two on the left side and one on the right for R11, and for R12, the left side and the right side do not even share a common factor. Since the factorization yielded by R11 and R12 is fundamentally different, we think that they have a different biological interpretation than R13. equivalent factorizations up to commutativity (R13) equivalent factorizations up to R11 and R12 We can also discuss another research direction by analyzing different mappings from RNA secondary structures to tangles. For example, in the mapping we discussed in this paper, if there is a pseudoknot p connecting stems and then in the corresponding tangle there will be three edges, one for each of them. In this framework, the interaction between two stems is represented by an edge intersecting their corresponding edges. We could, instead, think of another mapping in which stems connected by a pseudoknot will have their corresponding edges that cross each other (Fig. 15).

Fig. 15

Two different mappings. Two mapping in which pseudoknots are treated differently. and are two stems and p is a pseudoknot connecting them. a The mapping that Kauffman and Magarshak proposed. b Another mapping in which the pseudoknot corresponds to the intersection between and (grey dot)

We did not explore this alternative mapping, so we leave it as a future research direction. Two different mappings. Two mapping in which pseudoknots are treated differently. and are two stems and p is a pseudoknot connecting them. a The mapping that Kauffman and Magarshak proposed. b Another mapping in which the pseudoknot corresponds to the intersection between and (grey dot) Regarding the factorization algorithm, there are also some improvements that can be done with respect to the time complexity. Our methodology uses heuristics to obtain a non-minimal factorization and then refines it by using rewriting logic. This last step becomes prohibitive for large tangles, therefore a faster approach is necessary. During our research, we did not find an algorithm capable of such performances, but we have the hypothesis that the factorization problem for the Brauer Monoid could be solved in polynomial time. Let’s discuss now some practical applications our methodology could be used for. The factor representation we have discussed in this paper can be useful as an additional classification criterion for RNA secondary structures databases, in which a user could query RNAs that are generated only by a particular set of prime tangles, without the need of specifying the exact shape of the RNA molecule they are interested in. This could also lead to interesting applications in the context of sequence alignment, in which two sequences are compared not by the alignment of their nucleotides, but by their factor list. As we discussed in “Background” section, the folding problem is the focus of a large amount of research. In recent years, Machine Learning techniques have been widely used in this context, in which a model is trained to predict the optimal secondary structure from a sequence of nucleotides [25]. We imagine that a machine learning model could be trained to predict the full factorization of the optimal secondary structure so that its shape would be easily computable or, alternatively, a model capable of predicting just a subset of this factorization, thus greatly reducing the search space for the optimal structure. We have not investigated this path, so we leave it as a future research direction.

Conclusions

We have crossed the bridge that Kauffman and Magarshak have built between RNA secondary structures and the Brauer Monoid to pave the way for a novel prime tangle factorization for RNA secondary structures. Our results show that the presence of pseudoknots influences the type of factors the corresponding tangle has. Moreover, we proved that two interconnected sections of the RNA secondary structure will naturally share some factors. Since the exact interpretation of equivalent factorizaion is not clear, we expect further development in this direction. In any case, the proposed approach may reveal useful for reducing the search space for the optimal folding and for structure comparison and classification.

10 in total

RNA secondary structure factorization in prime tangles.

Background

RNA

Brauer monoid

Methods

Factoring -tangles

Heuristic 1

Heuristic 2

Minimal factorization

From RNA to tangle factorization

Examples

Results

Proposition 1

Proof

Proposition 2

Proposition 3

Discussion

Open Question

Conclusions

1. The crystal structure of yeast phenylalanine tRNA at 1.93 A resolution: a classic structure revisited.

2. Abstract shapes of RNA.

3. Shapes of RNA pseudoknot structures.

4. Topological classification of RNA structures.

5. Topology and prediction of RNA pseudoknots.

6. Classification and predictions of RNA pseudoknots based on topological invariants.

7. Process calculi may reveal the equivalence lying at the heart of RNA and proteins.

8. An algebraic language for RNA pseudoknots comparison.

9. Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics.