| Literature DB >> 18287116 |
Surya Saha1, Susan Bridges, Zenaida V Magbanua, Daniel G Peterson.
Abstract
Identification of dispersed repetitive elements can be difficult, especially when elements share little or no homology with previously described repeats. Consequently, a growing number of computational tools have been designed to identify repetitive elements in an ab initio manner, i.e. without using prior sequence data. Here we present the results of side-by-side evaluations of six of the most widely used ab initio repeat finding programs. Using sequence from rice chromosome 12, tools were compared with regard to time requirements, ability to find known repeats, utility in identifying potential novel repeats, number and types of repeat elements recognized and compactness of family descriptions. The study reveals profound differences in the utility of the tools with some identifying virtually their entire substrate as repetitive, others making reasonable estimates of repetition, and some missing almost all repeats. Of note, even when tools recognized similar numbers of repeats they often showed marked differences in the nature and number of repeat families identified. Within the context of this comparative study, ReAS and RepeatScout showed the most promise in analysis of sequence reads and assembled genomic regions, respectively. Our results should help biologists identify the program(s), if any, that is best suited for their needs.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18287116 PMCID: PMC2367713 DOI: 10.1093/nar/gkn064
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Comparison of ab initio repeat finders
| Tool | Run time (mins) | Portion (% bp) of dataset classified as repetitive | Potential novel repeats (% bp) | Number of families | Average length of family consensus sequence | Total bases in consensus sequence library | |
|---|---|---|---|---|---|---|---|
| Recon | 3866.17 | 38.1 | 25.8 | 15.7 | 1291 | 485 | 626 157 |
| ReAS | 7.14 | 50.1 | 15.3 | 2.1 | 181 | 443 | 80 235 |
| RepeatGluer | 4.55 | 99.9 | 99.3 | 72.2 | 38 | 157 451 | 5 983 148 |
| RepeatScout | 4.76 | 26.2 | 7.3 | 0.3 | 45 | 302 | 13 614 |
| RepeatFinder | 0.38 | 32.7 | 10.3 | 1.4 | 1000 | 153 | 153 531 |
| PILER | 0.11 | 0.04 | >0.1 | 0.0 | 1 | 420 | 420 |
| RepeatScout | 112.11 | 84.3 | 39.7 | 4.4 | 657 | 860 | 565 020 |
| RepeatFinder | 59.48 | 85.3 | 44.6 | 8.9 | 25 113 | 252 | 6 347 059 |
| PILER | 3.45 | 38.3 | 17.5 | 1.4 | 41 | 2767 | 113 473 |
Interlibrary Intersection values (%) for the different combinations of tools
Each value denotes the percentage of bases in the query library (consensus sequence set) that are masked by RepeatMasker using the reference library.
Figure 1.Sensitivity of Recon and ReAS in detecting known classes of repeats as previously identified by RepeatMasker with Repbase (RMRB). Simulated unassembled sequence reads were used as the initial substrate for the analyses. Neither ReAS nor Recon detected any ‘small RNAs’ and thus this repeat category is not shown.
Figure 2.Sensitivity of RepeatScout, RepeatFinder and PILER for detecting known classes of repeats as previously identified by RepeatMasker with Repbase (RMRB). The entire chromosome 12 sequence was used as the substrate for the analyses.
Figure 3.Composition of repeat libraries. (A) Results for tools using chromosome 12 simulated reads (18-Mb dataset) as a substrate. (B) Results for tools using the intact chromosome 12 sequence as substrate.