| Literature DB >> 17442102 |
Sébastien Leclercq1, Eric Rivals, Philippe Jarne.
Abstract
BACKGROUND: Microsatellites are short, tandemly-repeated DNA sequences which are widely distributed among genomes. Their structure, role and evolution can be analyzed based on exhaustive extraction from sequenced genomes. Several dedicated algorithms have been developed for this purpose. Here, we compared the detection efficiency of five of them (TRF, Mreps, Sputnik, STAR, and RepeatMasker).Entities:
Mesh:
Year: 2007 PMID: 17442102 PMCID: PMC1876248 DOI: 10.1186/1471-2105-8-125
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Number of detections per megabase, average length (bp), and average divergence (%) of detections for combinations of parameters in the human X chromosome.
| 110 | 64.44 | 3.96 | ||
| 202 | 47.65 | 3.68 | ||
| 458 | 32.14 | 3.21 | ||
| 2425 | 16.07 | 1.60 | ||
| 110 | 64.44 | 3.96 | ||
| 125 | 73.62 | 6.01 | ||
| 136 | 76.44 | 7.13 | ||
| 177 | 83.30 | 11.31 | ||
| 1368 | 22.96 | 12.39 | ||
| 1539 | 28.11 | 18.47 | ||
| 1636 | 32.21 | 22.15 | ||
| 1712 | 39.80 | 26.51 | ||
| 154 | 34.55 | 1.13 | ||
| 349 | 25.39 | 1.06 | ||
| 4273 | 11.23 | 0.48 | ||
| 6589 | 9.74 | 0.44 | ||
| 6555 | 9.33 | 0.01 | ||
| 6589 | 9.74 | 0.44 | ||
| 6818 | 10.12 | 1.19 |
TRF alignment weights were set to {2,7,7} when varying the minimum threshold score, and the minimum threshold score to 50 when alignment weights varied. Mreps resolution was 1, 2, 3, and 6. Sputnik mismatch penalty was set to -6 when varying the minimum threshold score, and the minimum threshold score to 7 when varying the mismatch penalty. Match bonus and fail score were always fixed to 1 and -1, respectively. Divergence is deduced from the alignment of the detected sequence with the perfectly repeated corresponding sequence of focal consensus motif:
Figure 1Length distributions for different minimum threshold scores of TRF. a- Number of detections (log scale) with TRF in the human X chromosome as a function of length (in bp) for minimum threshold score between 20 and 50. The alignment weights were {2,7,7}, and the few detections larger than 200 bp were discarded. The solid vertical line represents the minimum length not affected by the threshold score constraint. b- Number of detections (log scale) with Sputnik in the human X chromosome as a function of length (in bp) for validation score set to 7, 8, 15, and 20. The mismatch penalty was -6, and the few detections larger than 200 bp were discarded.
Detection sample obtained with TRF with different alignment weights, Sputnik with different mismatch penalty, and Mreps with different resolution, in the human X chromosome.
| 304646 | 304658 | 0 | CTCTC | CTCTCCTCTCCTC | ||
| 304696 | 304713 | 5.55 | TCCTC | TCCTCTTCTCTCCTCTCC | ||
| 305863 | 305872 | 0 | CCTTC | CCTTCCCTTC | ||
| 304646 | 304713 | 18.3099 | TCTCC | CTCTCCTCTCCTC | ||
| 305863 | 305872 | 0 | TTCCC | CCTTCCCTTC | ||
| 304646 | 304713 | 18.0556 | TCTCC | CTCTCCTCTCCTCCTTCTCCGCTCCCTGCACTGCCCTCCGCTCCCTCCGGTCCTCTTCTCTCCTCTCC | ||
| 305836 | 305872 | 17.9487 | TTCCC | |||
| 304643 | 304713 | 18.9189 | CTCCT | |||
| 305765 | 305800 | 25.641 | CCA | CCACACCACCTCTGACGCCCACCACAGCCCCCCACC | ||
| 305836 | 305872 | 17.9487 | CCCTT | CCCTCTCCACTTCCTTCTCTTCCACCTCCTTCCCTTC | ||
| 552928 | 552935 | 0 | AG | GAGAGAGA | ||
| 552939 | 552948 | 0 | AG | GAGAGAGAGA | ||
| 552954 | 552963 | 0 | AAGAG | AAGAGAAGAG | ||
| 552964 | 552975 | 0 | AG | AGAGAGAGAGAG | ||
| 552928 | 552935 | 0 | AG | GAGAGAGA | ||
| 552939 | 552948 | 0 | AG | GAGAGAGAGA | ||
| 552954 | 552975 | 9.09 | AAGAG | AAGAGAAGAGAGAGAGAGAGAG | ||
| 552928 | 552948 | 9.52 | AG | GAGAGAGA | ||
| 552954 | 552975 | 9.09 | AAGAG | AAGAGAAGAGAGAGAGAGAGAG | ||
| 119591 | 119610 | 20 | AAT | ACAAAAAATAATAATTATAA | ||
| 119611 | 119628 | 5.56 | AAAAAT | ATAAATAAAAATAAAAAT | ||
| 119591 | 119615 | 24 | AAT | ACAAAAAATAATAATTATAA | ||
| 119611 | 119628 | 5.56 | AAAAAT | ATAAATAAAAATAAAAAT | ||
| 119591 | 119638 | 33.33 | A | ACAAAAAATAATAATTATAAATAAATAAAAATAAAAAT | ||
| 119590 | 119638 | 34.69 | A | |||
Threshold alignment score of TRF was set to 20 and alignment weights varied from {2,7,7} to {2,3,5}. Sputnik mismatch penalty was set to -10, -6, and -5. Mreps resolution value varied from 1 to 6. For each detection, we report the start/end positions, divergence from a pure repeat, motif and actual sequence. Variation of detection when reducing weights is as follows: n: newly detected sequence; e: enlargement of a previous sequence; c: concatenation of previous sequences. New nucleotides detected by enlarging or concatenating previous sequences are underlined. The sequence at position 305765 is an example of a microsatellite detected at low values of alignment weights of TRF. It cannot be detected with alignment weights down to {2,3,5} because correct match bonuses cannot compensate for imperfection penalties. Reducing alignment weights may also enlarge detections, as shown for alignment weights {2,5,5} at position 305836. A succession of close errors (in boldface) decreases the alignment score, which falls under the threshold score for weight values larger than {2,5,5}. Reducing alignment weights also provokes concatenation, when an enlarged tandem repeat overlaps with one of its neighbors. At position 304696, two substitutions (in boldface), stops detection when alignment weights are set to {2,7,7}. With a smaller substitution penalty (5 or less), the detection is enlarged up to position 304646 and overlaps with the other detection. Reducing Sputnik mismatch penalty allows detection of larger microsatellites, by concatenating shorter, perfect ones. The two detections at position 552928 and 552939 are concatenated with a mismatch penalty of -5, because the penalty induced by two errors at position 552936 and 552938 are compensated by the second detection. A second concatenation occurs at position 552964 with a mismatch of -6. The two merged detections are not of the same motif, but the two errors induced by this difference are compensated by the matching bases with low values of mismatch penalty.
A larger resolution value for Mreps enlarges already-detected tandem repeats. In the first part of the tandem repeat at position 119591, adjacent repeats are separated by at most one error, and this part is detected at resolution 1; however repeats TAT and AAA are separated by two errors, so the second part can only be found at resolution 2 or higher. Finally, increasing resolution provokes concatenation. Detections for resolution 2 at positions 119591 and 199611 are enlarged when resolution is 3; both periods are reduced to 1 (see explanations in Methods), and the two sequences are merged.
Figure 2Length distributions of perfect detections for each algorithms. Number of perfect detections (log scale) in the human X chromosome as a function of length (in bp) for the six motif classes and for each algorithm. Sputnik groups all detections with a decimal number of repeats into the previous integer number of repeat class. The numbers of detections were averaged by motif size to display values for lengths representing a decimal number of repeats.
Number of detections per Mbp, average length, and average divergence for TRF, Mreps, Sputnik, STAR, and RepeatMasker, in the genome of four species.
| 2425 | 512 | 3119 | 2902 | 1822 | ||
| 1368 | 1084 | 1653 | 1371 | 879 | ||
| 6589 | 361 | 7475 | 7665 | 5712 | ||
| 395 | 260 | 311 | 343 | 182 | ||
| 256 | 179 | 207 | 230 | 104 | ||
| 16.07 | 28.84 | 14.24 | 14.61 | 13.85 | ||
| 22.96 | 24.99 | 20.04 | 20.93 | 20.28 | ||
| 9.74 | 19.83 | 9.39 | 9.35 | 8.98 | ||
| 39.89 | 49.80 | 31.07 | 32.86 | 33.12 | ||
| 53.97 | 64.93 | 48.52 | 45.80 | 54.88 | ||
| 1.60 | 7.59 | 1.61 | 1.47 | 1.35 | ||
| 12.39 | 15.65 | 11.46 | 10.10 | 11.71 | ||
| 0.44 | 7.96 | 0.46 | 0.38 | 0.32 | ||
| 7.45 | 11.33 | 7.98 | 6.44 | 7.59 | ||
| 8.40 | 11.97 | 13.42 | 9.31 | 13.14 | ||
Both imperfect and all (perfect plus imperfect) detections are provided for the human genome while all detections only are reported for the other genomes. HS = Homo sapiens, SC = Saccharomyces cerevisiae, DM = Drosophila melanogaster, NC = Neurospora crassa. Divergence is deduced from the alignment of the detected sequence with the perfectly repeated corresponding sequence of focal consensus motif:
Loci and nucleotide coverage between algorithms
| - | 34.94 | 20.4 | 9.51 | 7.37 | ||
| 85.61 | - | 45.3 | 17.26 | 12.6 | ||
| 82.63 | 80.85 | - | 33.34 | 24.63 | ||
| 95.29 | 93.92 | 93.61 | - | 57.98 | ||
| 100 | 97.89 | 97.64 | 82.13 | - | ||
Proportion of the total number of detections (perfect and imperfect) of algorithm A also detected (i.e., covered) by algorithm B in the human X chromosome. The value in brackets is the proportion of nucleotides detected by A and covered by B.
RepeatMasker, STAR, Mreps, TRF and Sputnik detections between starting positions 532800 and 53500 in the human X chromosome.
| 531688 | 531713 | 0 | AAT | AATAATAATAATAATAATAATAATAA | |
| 532355 | 532540 | 15.05 | TTCC | TTCCTTCCTCCCTTCCTTCCTTCCTTTCTTCTTTCTTTCTTTCCTTCCTTCCTGCTTTCCTTCCTTCC | |
| TTTCTTTTCTTTCTTTCCTTCCTTCCTTGCTTCCTTCCTTCCATCTTTCTCTTTCTCTTTTTCTTTCT | |||||
| TTCTCTCCTTCCTTCTTTCCTTCCTTCCTTCCCTTCCCTTCCTTCCTTCC | |||||
| 532704 | 532891 | 15.87 | TTCC | CCTTCCTTCCTTTCTTCTTTCTTTCCTTCCTTCCTTGCTTCCTTCCTTCCATCTTTCTTTCTTTCTTT | |
| CTTCCTCTCCTTCCTTCTTTCCTTCCTTCCTTCCCTTCCCTCCTTCCTTTTTCTTCTTCTCTTTCTTT | |||||
| CTTTCTCTTTCCTTCCTTCCTTCCTTCTTTCTCCTTCCTTCCTTCTTTCCTT | |||||
| 531688 | 531713 | 0 | AAT | AATAATAATAATAATAATAATAATAA | |
| 532537 | 532731 | 25.38 | TTTTTC | TTCCTTTTTCTTCTTCTCTTTCTTTCTTTCTTTTTCTTTCCTTCCTTCCTTCTTTCTCCTTCCTTCCT | |
| TCCATTTTTCTTTCTTTCTTTCTTTCTTTCTCTCTCTCTCTTTCTTTCTTTCTCTCTCTCTCTTCTTC | |||||
| CTTCCTTCCTTCCATTCTTCTTTCTTTCTTTCCTTCCTTCCTTTCTTCTTTCTTTCCTT | |||||
| 531688 | 531715 | 3.45 | AAT | AATAATAATAATAATAATAATAATAAAA | |
| 532330 | 532429 | 15.84 | TTCC | TTTCCTTCTTTCTTTCTTACTTTCTTTCCTTCCTCCCTTCCTTCCTTCCTTTCTTCTTTCTTTCTTTC | |
| CTTCCTTCCTGCTTTCCTTCCTTCCTTTCTTT | |||||
| 532428 | 532467 | 12.5 | TTCC | TTTCTTTCTTTCCTTCCTTCCTTGCTTCCTTCCTTCCATC | |
| 532466 | 532490 | 4 | TTTCTC | TCTTTCTCTTTCTCTTTTTCTTTCT | |
| 532491 | 532524 | 11.76 | TTCC | TTCTCTCCTTCCTTCTTTCCTTCCTTCCTTCCCT | |
| 532525 | 532542 | 5.56 | TTCC | TCCCTTCCTTCCTTCCTT | |
| 532551 | 532593 | 13.95 | TTTC | TCTCTTTCTTTCTTTCTTTTTCTTTCCTTCCTTCCTTCTTTCT | |
| 532593 | 532609 | 5.88 | TTCC | TCCTTCCTTCCTTCCAT | |
| 532609 | 532667 | 16.95 | TC | TTTTTCTTTCTTTCTTTCTTTCTTTCTCTCTCTCTCTTTCTTTCTTTCTCTCTCTCTCT | |
| 532667 | 532689 | 8.7 | TTCC | TTCTTCCTTCCTTCCTTCCATTC | |
| 532690 | 532756 | 11.94 | TTCC | TTCTTTCTTTCTTTCCTTCCTTCCTTTCTTCTTTCTTTCCTTCCTTCCTTGCTTCCTTCCTTCCATC | |
| 532755 | 532777 | 4.35 | TTTC | TCTTTCTTTCTTTCTTTCTTCCT | |
| 532776 | 532820 | 8.89 | TTCC | CTCTCCTTCCTTCTTTCCTTCCTTCCTTCCCTTCCCTCCTTCCTT | |
| 531688 | 531713 | 0 | AAT | AATAATAATAATAATAATAATAATAA | |
| 532313 | 532330 | 5.26 | TTTTC | TTTTCTTTTCTTTCTTTT | |
| 532423 | 532438 | 5.88 | TTTTC | TTTCTTTTCTTTCTTT | |
| 532466 | 532490 | 4 | TTTCTC | TCTTTCTCTTTCTCTTTTTCTTTCT | |
| 532544 | 532553 | 0 | TTC | TTCTTCTTCT | |
| 532550 | 532576 | 13.79 | TTTCTC | TTCTCTTTCTTTCTTTCTTTTTCTTTC | |
| 532633 | 532667 | 8.57 | TC | TCTCTCTCTCTCTTTCTTTCTTTCTCTCTCTCTCT | |
| 531568 | 531576 | 0 | ACC | ACCACCACC | |
| 531688 | 531711 | 0 | AAT | AATAATAATAATAATAATAATAAT | |
| 531849 | 531856 | 0 | TTGC | CTTGCTTG | |
| 531893 | 531900 | 0 | TG | TGTGTGTG | |
| 531927 | 531934 | 0 | ATGC | TGCATGCA | |
| 532078 | 532085 | 0 | AGGC | GCAGGCAG | |
| 532266 | 532273 | 0 | ATGC | TGCATGCA | |
| 532313 | 532322 | 0 | TTTTC | TTTTCTTTTC | |
| 532335 | 532354 | 5 | TTTC | TTCTTTCTTTCTTACTTTCT | |
| 532355 | 532422 | 10.29 | TTCC | TTCCTTCCTCCCTTCCTTCCTTCCTTTCTTCTTTCTTTCTTTCCTTCCTTCCTGCTTTCCTTCCTTCC | |
| 532423 | 532439 | 5.88 | TTTC | TTTCTTTTCTTTCTTTC | |
| 532440 | 532463 | 4.17 | TTCC | CTTCCTTCCTTGCTTCCTTCCTTC | |
| 532466 | 532489 | 4.17 | TTTCTC | TCTTTCTCTTTCTCTTTTTCTTTC | |
| 532500 | 532541 | 7.14 | TTCC | TCCTTCTTTCCTTCCTTCCTTCCCTTCCCTTCCTTCCTTCCT | |
| 532544 | 532552 | 0 | TTC | TTCTTCTTC | |
| 532553 | 532568 | 0 | TTTC | TCTTTCTTTCTTTCTT | |
| 532569 | 532576 | 0 | TTTC | TTTCTTTC | |
| 532577 | 532588 | 0 | TTCC | CTTCCTTCCTTC | |
| 532596 | 532607 | 0 | TTCC | TTCCTTCCTTCC | |
| 532615 | 532656 | 7.14 | TTTC | TTTCTTTCTTTCTTTCTTTCTCTCTCTCTCTTTCTTTCTTTC | |
| 532657 | 532666 | 0 | TC | TCTCTCTCTC | |
| 532669 | 532684 | 0 | TTCC | CTTCCTTCCTTCCTTC | |
| 532687 | 532692 | 0 | TTC | TTCTTC | |
| 532693 | 532704 | 0 | TTTC | TTTCTTTCTTTC | |
| 532705 | 532752 | 8.33 | TTCC | CTTCCTTCCTTTCTTCTTTCTTTCCTTCCTTCCTTGCTTCCTTCCTTC | |
| 532755 | 532774 | 0 | TTTC | TCTTTCTTTCTTTCTTTCTT | |
| 532780 | 532820 | 7.32 | TTCC | CCTTCCTTCTTTCCTTCCTTCCTTCCCTTCCCTCCTTCCTT |
Resolution of Mreps was set to 1, threshold alignment score of TRF to 20 and alignment weights of TRF to {2,7,7}. Sputnik mismatch penalty and validation score were set to -6 and 7, respectively. The number of detections varies with algorithms (from 3 to 18). Moreover, the sequence information is dealt with in different ways; an example is the region of cryptic simplicity between positions 532815 and 533080. RepeatMasker and STAR decompose it into large, distant and highly imperfect detections, though not the same for the two algorithms. Mreps returns a succession of shorter detections, overlapping the whole region. TRF detects only short, not much divergent, subregions, which do not completely overlap with the whole region. Sputnik detections are very numerous, short and slightly divergent, but overlap the whole region. Detection of compound microsatellites by Mreps is illustrated at position 533706, where other algorithms detect only a perfect polyA strech. The detection at position 534186 is returned as two detections by Mreps, because the two consecutive errors (insertions of G and C) stop the detection when resolution is set to 1. Very short hexanucleotides (12 bp) are detected by both TRF and Sputnik at positions 533138 and 534112. Most detections of Sputnik are two-repeat tetranucleotides, or three-repeat trinucleotides, which cannot be detected by other algorithms.