Literature DB >> 16689703

Fragrep: an efficient search tool for fragmented patterns in genomic sequences.

Axel Mosig¹, Katrin Sameith, Peter Stadler.

Abstract

Many classes of non-coding RNAs (ncRNAs; including Y RNAs, vault RNAs, RNase P RNAs, and MRP RNAs, as well as a novel class recently discovered in Dictyostelium discoideum) can be characterized by a pattern of short but well-conserved sequence elements that are separated by poorly conserved regions of sometimes highly variable lengths. Local alignment algorithms such as BLAST are therefore ill-suited for the discovery of new homologs of such ncRNAs in genomic sequences. The Fragrep tool instead implements an efficient algorithm for detecting the pattern fragments that occur in a given order. For each pattern fragment, the mismatch tolerance and bounds on the length of the intervening sequences can be specified separately. Furthermore, matches can be ranked by a statistically well-motivated scoring scheme.

Entities: Chemical Gene Species

Mesh：

Substances：
RNA, Untranslated

Year: 2006 PMID： 16689703 PMCID： PMC5054030 DOI： 10.1016/S1672-0229(06)60017-X

Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN： 1672-0229 Impact factor: 7.691

Introduction

Methods for detecting non-coding RNAs (ncRNAs) in genomic sequence data have been a topic of intense research. While techniques for detecting protein-coding genes can rely on universal characteristics such as start and stop codons, triplet amino acid codes, or ribosome binding sites, there are no corresponding characteristics known in ncRNAs. Early approaches to ncRNA detection were designed for specific types of RNAs, in particular tRNAs (. Other approaches to ncRNA searching were designed for detecting arbitrary ncRNA classes, typically based on the conserved sequences or secondary structure elements. The RNAMotif ( tool allows to describe a search pattern consisting of conserved stems and helices, while the ERPIN tool ( allows to annotate an alignment with secondary structure information, which is then used as a search pattern. The INFERNAL tool ( also derives its query from a multiple alignment. In contrast to ERPIN, however, the alignment is translated into a stochastic context-free grammar, which is then used for the (quite time-demanding) task of scanning genomic sequences. In general, the computational complexity of searching RNAs increases with the complexity of the search pattern, limiting the use of such methods for genome-wide surveys. In this paper, we present Fragrep, an efficient tool that is optimized for this kind of sequence-based searches. The approach implemented in the Fragrep tool is based on an elementary way of describing search patterns, allowing a highly efficient and hence genome-wide application. This approach is particularly fruitful for the classes of ncRNAs that contain stem or loop regions with well-conserved sequence patterns, such as Y RNAs and vault RNAs 5., 6., 7., which, however, are interrupted by non-conserved sequences of highly variable lengths. Compared to RNAMotif, we provide a statistically well-motivated ranking scheme, which relieves the user from defining an individual scoring scheme as in RNAMotif. On the other hand, Fragrep does not search for explicit secondary structure constraints. The problem of efficiently searching a large sequence database for interrupted sequence patterns is also relevant in the context of other ncDNA motifs, for example cis-regulatory modules (. In this context, the approach investigated in Fragrep is complementary to the motif discovery procedures such as BioOptimizer ( or Bipad (: once a suitable motif has been discovered, Fragrep can be applied to scan genome databases for such patterns (or constellations of patterns). In some cases, the fragmented patterns are informative enough to be clearly distinguished between true and false positives. In many other cases, however, Fragrep can at least act as an efficient filtering technique. The Fragrep tool can be downloaded from the URL http://www.bioinf.uni-leipzig.de/Software/fragrep/.

Algorithm

Suppose that the ncRNA of interest contains k conserved sequence fragments, denoted by C1, …, C, which occur in a given order in a set of known examples. In practice, the fragment C is obtained as the consensus sequence of conserved blocks in a multiple alignment. Scanning a genome T for these blocks, we expect to find a non-conserved sequence segment X between any two fragments C and C. Fragrep solves the problem of determining whether there are sequences X1, …, X so that the string C1X1C2X2…XC is contained as a substring in T. Additionally, Fragrep can take into account two further aspects: Gap length bounds: For each X, the user can specify the upper and lower bounds of the length, denoted by u and ℓ, respectively; only the matches satisfying ℓ ≤|X|≤ u will be taken into account by Fragrep. Mismatches: The fragment C does not need to match the corresponding sequence part of T exactly; the user can specify the number of mismatches (m). Denoting as the modified fragment of C by at most m arbitrary mismatches, Fragrep will report the occurrences of as well. We also refer to the string C1X1C2X2…XC satisfying all these constraints as a matching subsequence of T. Similar features are incorporated in other tools such as RNAbob (ftp://ftp.genetics.wustl.edu/pub/eddy/software/rnabob-2.1.tar.Z), which is based on a nondeterministic finite state machine with node rewriting rules instead of the dynamic programming approach used by Fragrep. The algorithm underlying Fragrep essentially works in two steps: For each i ϵ [1 : k], compute a list of all occurrences of C in T. Apply a dynamic programming algorithm to the lists computed in the first step in order to find all matching subsequences in T. Performing step 1 is straight-forward. As a result, we obtain an ordered list of indices , with y denoting the position of the jth occurrence of C in T, and L denoting the number of occurrences of C in T. Using Y1, …, Y, we now set up a graph G = (V, E) with the vertex V = {(i, j) | i ϵ [1 : k], j ϵ [1 : L]} and an edge between (i, j) and (i+1, j′), whenever the corresponding occurrences of C and C satisfy the upper and lower bounds for the gap in between. Obviously, any path of length k − 1 in G corresponds to a valid occurrence of C1, …, C in T. For each (i,j), i ϵ [1 : k] and j ϵ [1 : L], we now compute Apparently, we have a valid occurrence of C1, …, C in T whenever we have M = 1 for some k ϵ [1 : L]. Furthermore, we have M1, = 1 for all j ϵ [1 : L1] (since every occurrence of C1 is a valid occurrence up to the first fragment). Now, using the graph G, we can compute all M for i > 1 as Starting with i = 1, the M values can be computed using dynamic programming in a straight-forward way. Now, each non-zero entry in the kth row of M indicates at least one valid match. Defining L := max L, altogether k · L matrix entries are computed, so that the overall time complexity for computing M is O(kL); enumerating all μ matching subsequences of T can be done in O(kL + µ) time. The match number µ is proportional to the tolerance allowed by the gap length bounds, that is, u−ℓ. While in principle µ is bounded by O(L2) (since each occurrence of C1 might yield one matching subsequence for each occurrence of C), one is naturally interested in queries that produce few significant rather than an abundance of insignificant matches. Hence, for all practical purposes, O(kL) should be seen as the dominating term in the running time. Note that the above procedure can be easily adapted to start dynamic programming with the most informative sequence C rather than C1 by starting in the ath row of M. This increases the search efficiency and in practice leads to a significant speedup, in particular when short or ambiguous fragments are part of the pattern. The C++ implementation of Fragrep has been optimized in this and several other algorithmic details to improve the runtime. The current implementation only searches for gap-free C patterns. This limitation could be relaxed by employing a different pattern matching algorithm that allows gaps at a prescribed gap function. The use of the exact Smith-Waterman local sequence alignment algorithm for this purpose is possible but computationally quite expensive. In our applications, gaps were restricted to specific positions. In this case, it is more efficient to break the search pattern into smaller units linked by short intervals of variable lengths. Since the identification of these breaks is self-evident in the applications under our consideration, one obtains significantly more specific query patterns than for gapped alignment matches. A major issue in evaluating the quality of the matches produced by the above procedure is to assess how surprising a given match is, that is, how likely it is to be observed in a random sequence. To this end, Fragrep provides p- and E-value-like ranking schemes that are computed from a dinucleotide-based Markov model. In order to adapt the Markov models to the occurrences of a fragmented rather than a contiguous sequence pattern, we let E denote the event of C1, …, C being observed at least once in a sequence of length N in the given order and satisfying the distance constraints given by the respective upper and lower bounds u and ℓ for each C (formally dealing with a probability space over all sequences of length N). Furthermore, let q(j, L) denote the event of observing C at least once in a sequence of length L. Our main interest obviously is to determine the probability p(E). We start with computing the probability p(E1). Denote M as the first order Markov model resulting from the dinucleotide frequency distribution in T, we may compute P(C1) := P(C1|M) as the probability of the fragment C1 being produced by M. In order to obtain p(E1), we assume that the probability of C1 being produced at a position x in a sequence of length N is independent of the probability of C1 being produced at any other position y in T—note that this assumption holds for the fragments that do not contain any substring of length 2 more than twice, and, for all practical purposes, is a sufficiently good approximation for the fragments that are short and contain only few repetitive substrings. Now, under this assumption of independence, we obtain The probabilities q(j, L) can be computed analogously for arbitrary j and L. It is now easy to see that for j ϵ {2, …, k}, we have(since, in a sense, C needs to be generated by M in a sequence of length uj − ℓ + |C|) so that we finally obtain As described above, p(E) is the probability of observing at least one exact match of the given fragmented pattern in a sequence of length N. This value can be easily adapted to the scenario involving occurrences with a certain number of mismatches by modifying the probabilities P(C) accordingly. Since different matches obtained by Fragrep generally have different mismatches in different positions, we can also compute the analogous probabilities for the individual matches detected by Fragrep. Finally, w(E) := −log (p(E)) provides a convenient and statistically well-motivated ranking scheme for the matches.

Results and Discussion

We used Fragrep to studying the evolution of a class of ncRNAs in the slime mold Dictyostelium discoideum that was discovered in an experimental survey by Aspegren et al. (. We searched the genomic sequence ( for the type-I ncRNAs using the following simple pattern: The first two columns contain the minimal and maximal distance between the pattern fragment (always 0 for the first fragment, of course), the last column is the maximal number of mismatches that is tolerated in each fragment. The gap length in the sequences derived from the study by Aspegren et al. ( ranges between 58 and 88 nucleotides, so that 120 is a reasonable choice for the gap length’s upper bound. We recovered 45 candidates, of which 34 were sufficiently similar to the experimentally determined sequences to be alignable. The other 11 very divergent sequences were not included in the further analysis. A neighborjoining tree summarizing both known sequences and the novel candidates detected by Fragrep is displayed in Figure 1. We find that the class-I ncRNAs are located in small clusters in all six chromosomes. Interestingly, there are two subclasses, denoted by A and B, that alternate in the larger clusters, even though their directions on the chromosomes do not seem to follow a simple rule.

Fig. 1

The type-I ncRNAs from Dictyostelium discoideum. Top Left: The phylogenetic tree (neighborjoining method) suggests that there are two major subgroups, labeled as A and B. Leaf labels refer to the positions of the corresponding occurrences within the genome; for instance, X4a-5 refers to the fifth member within cluster a in Chromosome 4 (see the middle part of the figure); plus (+) or minus (−) indicates the occurrences in the 5′ or 3′ direction. Top Right: The type-I ncRNAs that appear in clusters on all chromosomes. The clusters are labeled by lower case letters, and the italic numbers below the clusters indicate the DdR- numbers of the expressed RNAs from the experimental survey by Aspegren et al. (. Bottom: The organization of the two largest clusters a and b located at Chromosome 4. Note that type A and type B copies alternate. The other type-I ncRNA clusters consist of no more than three sequences.

In order to evaluate the performance of the algorithm underlying Fragrep, we used the query derived from the vault RNA A-, B1-, and B2-box consensus structures in Kickhoefer et al. ( to scan the whole human genome. The query consisted of three fragments, each of which was 11 nucleotides long. Scanning all chromosomes of the human genome took less than 10 minutes on a standard desktop computer with a 2 GHz processor and 1 GB main memory. Further results from scanning the human as well as the mouse, rat, and dog genomes are listed in Table 1.

Table 1

Surveys of Mammalian Genomes for vault RNA Candidates

Genome	Size (Mb)	Runtime (mm:ss)	No. of matches
Homo sapiens	2,980	9:24	14
Mus musculus	2,561	7:36	35
Rattus norvegicus	2,640	8:33	44
Canis familiaris	2,454	7:55	768

In order to obtain an estimation of the influence of high gap-length tolerance on the running time, we modified the above query to allow for gap lengths up to 20,000, resulting in an increase of the running time by less than five folds. Note that this increase in fault tolerance also increased the number of matching subsequences to the order of hundreds of thousands, so that the significance of such highly fault tolerant queries is already limited from a practical point of view. In spite of this undue fault tolerance, the running time remains within acceptable bounds. These examples demonstrate that Fragrep can be used for systematic surveys of eukaryotic genomes. The application of standard multiple alignment tools such as ClustalW or Dialign to a relatively small set of representatives of an ncRNA class can be used to determine conserved sequence patterns, which can be turned into Fragrep queries in a straight-forward manner. The Fragrep tool can then be employed to find additional members of the ncRNA family in related genomes. This approach yields significant matches where other sequence search tools such as BLAST fail to report useful results, while the structure based approaches such as INFERNAL ( are too costly. Naturally, Fragrep is not limited to ncRNA detection; the search for specific constellations of transcription factor binding sites is another potential application. Furthermore, the approach could be easily adapted to searching peptide motifs in protein databases.

0	0	GTTGRCCTTACAGCAA	2
0	120	GTCAACTG	2

0	TRGCNNAGYGG	1
100	GGTTCGANTCC	1
100	GGTTCGANTCC	1

12 in total

1. RNAMotif, an RNA secondary structure definition and search algorithm.

Authors: T J Macke; D J Ecker; R R Gutell; D Gautheret; D A Case; R Sampath
Journal: Nucleic Acids Res Date: 2001-11-15 Impact factor: 16.971

2. Direct RNA motif definition and identification from multiple sequence alignments using secondary structure profiles.

Authors: D Gautheret; A Lambert
Journal: J Mol Biol Date: 2001-11-09 Impact factor: 5.469

3. BioOptimizer: a Bayesian scoring function approach to motif discovery.

Authors: Shane T Jensen; Jun S Liu
Journal: Bioinformatics Date: 2004-02-12 Impact factor: 6.937

4. A discriminative model for identifying spatial cis-regulatory modules.

Authors: Eran Segal; Roded Sharan
Journal: J Comput Biol Date: 2005 Jul-Aug Impact factor: 1.479

5. Conserved features of Y RNAs revealed by automated phylogenetic secondary structure analysis.

Authors: A D Farris; G Koelsch; G J Pruijn; W J van Venrooij; J B Harley
Journal: Nucleic Acids Res Date: 1999-02-15 Impact factor: 16.971

6. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence.

Authors: T M Lowe; S R Eddy
Journal: Nucleic Acids Res Date: 1997-03-01 Impact factor: 16.971

7. Conserved features of Y RNAs: a comparison of experimentally derived secondary structures.

Authors: S W Teunissen; M J Kruithof; A D Farris; J B Harley; W J Venrooij; G J Pruijn
Journal: Nucleic Acids Res Date: 2000-01-15 Impact factor: 16.971

8. dictyBase: a new Dictyostelium discoideum genome database.

Authors: Lisa Kreppel; Petra Fey; Pascale Gaudet; Eric Just; Warren A Kibbe; Rex L Chisholm; Alan R Kimmel
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

9. Identification of conserved vault RNA expression elements and a non-expressed mouse vault RNA gene.

Authors: Valerie A Kickhoefer; Nil Emre; Andrew G Stephen; Michael J Poderycki; Leonard H Rome
Journal: Gene Date: 2003-05-08 Impact factor: 3.688

10. A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure.

Authors: Sean R Eddy
Journal: BMC Bioinformatics Date: 2002-07-02 Impact factor: 3.169

10 in total

Review 1. Treasure hunt in an amoeba: non-coding RNAs in Dictyostelium discoideum.

Authors: Andrea Hinas; Fredrik Söderbom
Journal: Curr Genet Date: 2007-03 Impact factor: 3.886

2. pRNA: NoRC-associated RNA of rRNA operons.

Authors: Stefanie Wehner; Anja K Dörrich; Philipp Ciba; Annegret Wilde; Manja Marz
Journal: RNA Biol Date: 2013-12-20 Impact factor: 4.652

3. Inverse folding based pre-training for the reliable identification of intrinsic transcription terminators.

Authors: Vivian B Brandenburg; Franz Narberhaus; Axel Mosig
Journal: PLoS Comput Biol Date: 2022-07-07 Impact factor: 4.779

4. Evolution of the vertebrate Y RNA cluster.

Authors: Axel Mosig; Meng Guofeng; Bärbel M R Stadler; Peter F Stadler
Journal: Theory Biosci Date: 2007-04-05 Impact factor: 1.315

5. maxAlike: maximum likelihood-based sequence reconstruction with application to improved primer design for unknown sequences.

Authors: Peter Menzel; Peter F Stadler; Jan Gorodkin
Journal: Bioinformatics Date: 2010-12-01 Impact factor: 6.937

6. Structure and function of echinoderm telomerase RNA.

Authors: Joshua D Podlevsky; Yang Li; Julian J-L Chen
Journal: RNA Date: 2015-11-23 Impact factor: 4.942

7. Role of Fasciola hepatica Small RNAs in the Interaction With the Mammalian Host.

Authors: Santiago Fontenla; Mauricio Langleib; Eduardo de la Torre-Escudero; Maria Fernanda Domínguez; Mark W Robinson; José Tort
Journal: Front Cell Infect Microbiol Date: 2022-01-20 Impact factor: 5.293

8. The functional requirement of two structural domains within telomerase RNA emerged early in eukaryotes.

Authors: Joshua D Podlevsky; Yang Li; Julian J-L Chen
Journal: Nucleic Acids Res Date: 2016-07-04 Impact factor: 16.971

9. U7 snRNAs: a computational survey.

Authors: Manja Marz; Axel Mosig; Bärbel M R Stadler; Peter F Stadler
Journal: Genomics Proteomics Bioinformatics Date: 2007-12 Impact factor: 7.691

10. Abundantly expressed class of noncoding RNAs conserved through the multicellular evolution of dictyostelid social amoebas.

Authors: Jonas Kjellin; Lotta Avesson; Johan Reimegård; Zhen Liao; Ludwig Eichinger; Angelika Noegel; Gernot Glöckner; Pauline Schaap; Fredrik Söderbom
Journal: Genome Res Date: 2021-01-21 Impact factor: 9.043

10 in total