| Literature DB >> 30165507 |
Loredana M Genovese1, Marco M Mosca2, Marco Pellegrini1,3, Filippo Geraci1.
Abstract
MOTIVATION: Large-scale sequencing projects have confirmed the hypothesis that eukaryotic DNA is rich in repetitions whose functional role needs to be elucidated. In particular, tandem repeats (TRs) (i.e. short, almost identical sequences that lie adjacent to each other) have been associated to many cellular processes and, indeed, are also involved in several genetic disorders. The need of comprehensive lists of TRs for association studies and the absence of a computational model able to capture their variability have revived research on discovery algorithms.Entities:
Mesh:
Year: 2019 PMID: 30165507 PMCID: PMC6419916 DOI: 10.1093/bioinformatics/bty747
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Sample data structure for the matrix associated to the sequence TTACGACGTACGATGACGACGT
Average precision σ, average recall σ, average number of reported results covering a target locus (RPL) and total number of TRs intersected (with Jaccard=0.5 and Jaccard=0.7) of the compared algorithms and datasets
| Dataset | Measure | Dot2dot-filter | Dot2dot | TRF | MREPS | TRStalker | TRStalker* | SWAN | Troll | SciRoKo | Repeatmasker |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Diseases | 0.678 | 0.722 | 0.630 | 0.464 | 0.491 | 0.372 | 0.508 | 0.724 | 0.452 | ||
| 0.835 | 0.899 | 0.763 | 0.686 | 0.901 | 0.575 | 0.571 | 0.752 | 0.475 | |||
| RPL | 6.0 | 1.4 | 1.5 | 155.9 | 157.5 | 2.0 | 4.1 | 1.1 | 1.2 | ||
| #TR j=0.5 | 44 | 37 | 37 | 43 | 27 | 29 | 38 | 24 | |||
| #TR j=0.7 | 36 | 42 | 33 | 23 | 43 | 24 | 18 | 33 | 20 | ||
| CODIS | 0.721 | 0.836 | 0.659 | 0.485 | 0.565 | 0 | 0.682 | 0.819 | 0.812 | ||
| 0.860 | 0.899 | 0.850 | 0.818 | 0.839 | 0 | 0.797 | 0.867 | 0.822 | |||
| RPL | 6.3 | 1.1 | 2.1 | 188.2 | 172.2 | 0.0 | 3.2 | 1.2 | |||
| #TR j=0.5 | 19 | 19 | 18 | 0 | 19 | 19 | |||||
| #TR j=0.7 | 18 | 18 | 15 | 15 | 18 | 0 | 16 | 16 | 15 | ||
| Y-STR | 0.706 | 0.770 | 0.617 | — | 0.535 | 0.021 | 0.682 | 0.827 | 0.721 | ||
| 0.879 | 0.916 | 0.786 | 0.712 | — | 0.029 | 0.777 | 0.841 | 0.729 | |||
| RPL | 6.1 | 1.1 | 2.1 | — | 205.5 | 2.5 | 2.8 | 1.1 | |||
| #TR j=0.5 | 84 | 70 | 74 | — | 3 | 79 | 80 | 67 | |||
| #TR j=0.7 | 74 | 81 | 65 | 59 | — | 1 | 75 | 69 | 62 | ||
| Marshfield | 0.637 | 0.784 | 0.662 | — | 0.559 | 0 | 0.683 | 0.810 | 0.767 | ||
| 0.860 | 0.905 | 0.794 | 0.755 | — | 0 | 0.792 | 0.830 | 0.775 | |||
| RPL | 7.1 | 1.1 | 2.0 | — | 154.3 | 0.0 | 3.2 | 1.1 | |||
| #TR j=0.5 | 589 | 609 | 559 | 577 | — | 0 | 583 | 587 | 554 | ||
| #TR j=0.7 | 503 | 557 | 447 | 405 | — | 0 | 490 | 471 | 437 | ||
| Promoters | 0.667 | 0.663 | 0.575 | 0.422 | 0.422 | 0.663 | 0.517 | 0.808 | 0.743 | ||
| 0.803 | 0.548 | 0.687 | 0.929 | 0.929 | 0.447 | 0.627 | 0.709 | 0.432 | |||
| RPL | 1.0 | 4.8 | 1.4 | 1.9 | 116.0 | 116.0 | 2.2 | 2.1 | 1.0 | ||
| #TR j=0.5 | 13 783 | 15 192 | 8864 | 12 490 | 15 264 | 5156 | 10 939 | 12 125 | 7128 | ||
| #TR j=0.7 | 12 329 | 14 792 | 7905 | 7880 | 15 098 | 3233 | 5723 | 10 492 | 6286 | ||
Note: TRStalker run on the trimmed sequences is reported as TRStalker *. (The best value is highlighted in bold).
Running time of the compared algorithms over seven common reference genomes
| Organism name | Size Mb | Dot-to-dot filter | TRF | Mreps | Tandem SWAN | SciRoKo | Troll |
|---|---|---|---|---|---|---|---|
| Pinus lambertiana | 27602.70 | 12 h 15 min 18 s | 16 h 15 min 14 s | N/A | N/A | 2 h 13 min 51 s | N/A |
| Triticum aestivum | 13427.40 | 5 h 57 min 14 s | 11 h 13 min 43 s | N/A | N/A | 41 min 59 s | N/A |
| Locusta migratoria | 5759.80 | 2 h 37 min 14 s | 2 h 25 min 12 s | 11 h 20 min 14 s | N/A | 18 min 39 s | N/A |
| Homo sapiens | 3241.95 | 1h 24min 15s | 2 h 40 min 51 s | 2 h 36 min 00 s | 12 h 31 min 48 s | 10 min 43 s | 35 min 14 s |
| Rattus norvegicus | 2870.18 | 1 h 18 min 52 s | 2 h 04 min 44 s | N/A | N/A | 9 min 40 s | N/A |
| Mus musculus | 2807.72 | 1 h 16 min 29 s | 1 h 55 min 37 s | N/A | N/A | 9 min 41 s | N/A |
| Danio rerio | 1371.72 | 36 min 29 s | 2 h 33 min 36 s | 2 h 07 min 54 s | 13 h 59 min 40 s | 5 min 7 s | 17 min 16 s |
| NA12878 | 82705.67 | 1 d 12 h 15 min 18 s | 1 d 16 h 15 min 14 s | — | — | — | — |
Note: The genome dimension is measured as the size in Mb of the corresponding fasta file.