| Literature DB >> 25811026 |
Robson da Silva Lopes1, Walas Jhony Lopes Moraes1, Thiago de Souza Rodrigues2, Daniella Castanheira Bartholomeu3.
Abstract
Repetitive element sequences are adjacent, repeating patterns, also called motifs, and can be of different lengths; repetitions can involve their exact or approximate copies. They have been widely used as molecular markers in population biology. Given the sizes of sequenced genomes, various bioinformatics tools have been developed for the extraction of repetitive elements from DNA sequences. However, currently available tools do not provide options for identifying repetitive elements in the genome or proteome, displaying a user-friendly web interface, and performing-exhaustive searches. ProGeRF is a web site for extracting repetitive regions from genome and proteome sequences. It was designed to be efficient, fast, and accurate and primarily user-friendly web tool allowing many ways to view and analyse the results. ProGeRF (Proteome and Genome Repeat Finder) is freely available as a stand-alone program, from which the users can download the source code, and as a web tool. It was developed using the hash table approach to extract perfect and imperfect repetitive regions in a (multi)FASTA file, while allowing a linear time complexity.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25811026 PMCID: PMC4355816 DOI: 10.1155/2015/394157
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1ProGeRF architecture. Structure of the tool both for the web environment and for the stand-alone mode. The dark blue rectangles with rounded corners represent interfaces with the system. The transparent rectangles with a blue background represent algorithms done in C or Perl. The process script receives data from the web environment, treats the data, saves them in a MySql database, and calls the repetition extract module.
Figure 2Creating degeneration hash table: Step 1: sliding window maps each motif of the sequence for a position in the degeneration hash table and sets value 1 to mapped position. Step 2: generate possible degeneration of the sliding window and store in the buffer at position k of the sliding window; only the degeneration that mapped to a position of the hash table presents a value of 1.
Figure 3Creating repetitive element hash table: Step 1: sliding window maps each motif of the sequence for a position in the repetitive element hash table and sets value 1 to mapped position, and add or remove the sliding window to single bucket; Step 2: check whether the current sliding window is a degeneration of some motif ever recorded in buckets of REHT; Step 3: for each existing degeneration in the buffer the function h() is applied and then converted into an integer k and, soon after, Step 1 is performed.
Comparison of amount detection and execution times (in seconds) of Mreps, Misa, Sputnik, GMATo, SciRoKo, TRF, and ProGeRF. The features were run on a Dell Inspiron, Intel core 2 duo 2.2 GHz processor with 2 MB cache, 3 GB RAM, 320 GB hard drive, Ubuntu operating system 14.04 LTS 32 bits.
| Sequence | Mreps | Misa | Sputnik | GMATo | SciRoKo | TRF | ProGeRF |
|---|---|---|---|---|---|---|---|
| Rep (time) | Rep (time) | Rep (time) | Rep (time) | Rep (time) | Rep (time) | Rep (time) | |
| NC_004318.1 (1204 kb) | 9608 (2.8) | 22867 (3.2) | 7420 (0.7) | 23539 (10.3) | 3763 (1.1) | 30244 (72.4) | 26164 (3.9) |
| NC_001136.8 (1531 kb) | 935 (1.4) | 10640 (3.3) | 1427 (0.9) | 10721 (7.7) | 185 (0.7) | 8101 (4.5) | 11552 (2.4) |
| NC_000962.2 (4411 kb) | 1412 (3.9) | 6832 (8.9) | 3140 (1.46) | 6846 (12.1) | 72 (1.5) | 19496 (24.5) | 11422 (4.0) |
|
| — (—) | 2054241 (868.3) | 480644 (105.7) | 2073643 (9859.1) | 47770 (129.0) | 2438036 (1481.5) | 2319812 (1352.0) |
*Whole genome. The value in brackets is the runtimes in seconds.
Loci and nucleotide coverage between tools.
| Sequence | B | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Tools | Mreps | Misa | Sputnik | GMATo | SciRoKo | TRF | ProGeRF | ||
|
| A | Mreps | — | 78 (45) | 53 (33) | 78 (45) | 41 (36) | 98 (74) | 89 (60) |
| Misa | 47 (63) | — | 21 (4) | 100 (99) | 18 (40) | 88 (79) | 100 (98) | ||
| Sputnik | 91 (86) | 70 (74) | — | 70 (74) | 58 (69) | 0 (0) | 83 (84) | ||
| GMATo | 49 (62) | 100 (99) | 21 (40) | — | 19 (40) | 89 (79) | 100 (98) | ||
| SciRoKo | 96 (95) | 93 (76) | 95 (71) | 92 (76) | — | 95 (96) | 98 (91) | ||
| TRF | 51 (41) | 54 (32) | 0 (0) | 54 (32) | 16 (21) | — | 68 (46) | ||
| ProGeRF | 46 (56) | 86 (67) | 21 (31) | 86 (67) | 16 (32) | 87 (78) | — | ||
|
| |||||||||
| SAC Chr4 | A | Mreps | — | 60 (40) | 42 (26) | 68 (40) | 18 (20) | 95 (74) | 86 (62) |
| Misa | 7 (12) | — | 3 (7) | 100 (99) | 1 (4) | 33 (37) | 100 (99) | ||
| Sputnik | 30 (35) | 29 (30) | — | 29 (30) | 12 (1) | 74 (73) | 38 (39) | ||
| GMATo | 7 (12) | 100 (99) | 3 (7) | — | 1 (4) | 33 (37) | 100 (99) | ||
| SciRoKo | 91 (89) | 77 (61) | 86 (59) | 77 (61) | — | 99 (99) | 94 (72) | ||
| TRF | 15 (16) | 41 (26) | 13 (12) | 41 (26) | 2 (5) | — | 48 (35) | ||
| ProGeRF | 8 (16) | 92 (80) | 4 (7) | 92 (80) | 1 (5) | 34 (39) | — | ||
|
| |||||||||
| MTB H37Rv | A | Mreps | — | 9 (3) | 15 (7) | 9 (3) | 3 (3) | 91 (71) | 75 (58) |
| Misa | 2 (3) | — | 1 (1) | 100 (99) | 0.5 (1) | 13 (13) | 100 (99) | ||
| Sputnik | 6 (7) | 2 (2) | — | 2 (2) | 1 (1) | 66 (64) | 14 (14) | ||
| GMATo | 2 (3) | 100 (99) | 1 (1) | — | 0.5 (1) | 13 (13) | 100 (99) | ||
| SciRoKo | 73 (74) | 47 (36) | 63 (41) | 47 (36) | — | 100 (100) | 79 (72) | ||
| TRF | 8 (7) | 4 (1) | 10 (7) | 4 (1) | 0.4 (0.5) | — | 18 (13) | ||
| ProGeRF | 9 (16) | 60 (35) | 4 (4) | 60 (35) | 0.5 (1) | 29 (33) | — | ||
Percentage of the total number of detections (perfect and imperfect) of tools A also detected (i.e., covered) by tools B. The value in brackets is the proportion of nucleotides detected by A and covered by B.
Repetitive protein elements found by the web tool ProGeRF.
| ID sequence | Locus | Motif | Rep. |
|
| |||
|
| 62–97 | GASAQS | 6 |
|
| 146–297 | PNAN | 38 |
|
| 693–712 | KEKEE | 4 |
The parameters used were motif size between 2 and 6, repetitions of the least 4 motifs, and zero for the gaps, overlap, and degeneration.
Figure 4Screen shot from circumsporozoite protein (ACO49545.1), merozoite surface protein 1 (XP_001352170.1), and merozoite surface protein 9 (AAN36363.1) element repetitive search: (a) visualization of results through the jqGrid plugin: by clicking over the repetitive element the graphical view is opened; (b) repetitive elements are mapped and displayed graphically through JBrowse.