| Literature DB >> 26206263 |
Hani Z Girgis1,2.
Abstract
BACKGROUND: With rapid advancements in technology, the sequences of thousands of species' genomes are becoming available. Within the sequences are repeats that comprise significant portions of genomes. Successful annotations thus require accurate discovery of repeats. As species-specific elements, repeats in newly sequenced genomes are likely to be unknown. Therefore, annotating newly sequenced genomes requires tools to discover repeats de-novo. However, the currently available de-novo tools have limitations concerning the size of the input sequence, ease of use, sensitivities to major types of repeats, consistency of performance, speed, and false positive rate.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26206263 PMCID: PMC4513396 DOI: 10.1186/s12859-015-0654-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Method overview. a A sequence of scores: The score of each nucleotide is the adjusted count of the k-mer starting at this nucleotide. b Smoothed scores: The smoothed score is the weighted average of the flanking scores. The weights are assigned according to a Gaussian distribution. The local maxima, marked by ‘+’, are located using the second derivative test. c Candidate regions: The labeling module locates candidates (thin and colored in red) and potential non-repetitive regions (thick and colored in black). Regions found in the whole genome are used for training the hidden Markov model (HMM). d Final regions: The scanning module applies the trained HMM to locate the final repetitive regions (thin and colored in red). Notice that the final repetitive regions are less fragmented than the candidates. Additionally, they include all local maxima even the ones that were missed by the labeling module
Fig. 2Gaussian mask. This example mask represents the weights used by the labeling module for smoothing a sequence of scores. The width of this example mask is 40. Therefore, the smoothed score is the weighted average of the scores of the 40-bp-long region centered on the score of interest
Fig. 3Example of the HMM structure. In this simplified example, the HMM consists of four states: two states representing repeats (R and R ) and two states representing non-repeats (N and N ). The model has transitions from each state to the other three states. Additionally, there is a transition from each state to itself to allow the model to stay in the same state that generates multiple subsequent scores. The assumption underlying this structure is that repetitive regions consist mainly of high scores interleaved with a small number of low ones; in contrast, non-repetitive regions consist mainly of low scores interleaved with a small number of high ones. States R and R generate high and low scores in repetitive regions. States N and N generate high and low scores in non-repetitive regions
Comparisons of the performances of RepeatScout, ReCon, WindowMasker, and Red. Repeats detected by RepeatMasker are considered the ground truth in this study
| SN | SN | SN | SN | SN | SP | PP | FPL | PR | Time | Memory | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Tool | (%) | (%) | (%) | (%) | (%) | (%) | (%) | (bp) | (bp) | (sec) | (MB) |
|
| |||||||||||
| RS | 62.5 | 79.6 | 13.6 | 29.2 | 63.5 | 90.5 | 33.7 | 239474 | 9355324a | 948,350 | 4701 |
| Red | 61.0 | 86.2 | 68.9 | 33.9 | 62.8 | 89.3 | 35.5 | 2657024 | 16414125a | 5184 | 6775 |
| WM | 55.2 | 74.9 | 81.9 | 25.7 | 56.7 | 87.2 | 36.1 | 423707488 | 3109241a | 14866 | 615 |
| RC | 55.0 | 75.2 | 11.0 | 11.5 | 56.2 | 95.4 | 29.2 | 137633 | 3575640a | 898,844 | 14666 |
|
| |||||||||||
| Red | 90.0 | 59.4 | 43.3 | 83.0 | 84.1 | 94.0 | 23.2 | 312686 | 9401953 | 206 | 916 |
| RS | 86.3 | 24.3 | 1.9 | 71.8 | 74.4 | 98.0 | 18.4 | 0 | 4913141 | 79008 | 979 |
| RC | 86.7 | 18.2 | 1.8 | 80.0 | 74.0 | 99.0 | 17.6 | 0 | 4002422 | 13979 | 1513 |
| WM | 45.5 | 64.8 | 62.7 | 42.1 | 48.8 | 90.8 | 22.3 | 15150087 | 17118084 | 2869 | 325 |
|
| |||||||||||
| RS | 96.7 | 55.5 | 25.8 | 89.9 | 96.3 | – | 80.0 | 44447 | 66587503 | 347082 | 7344 |
| Red | 93.3 | 58.1 | 31.9 | 88.7 | 93.0 | – | 78.8 | 6257 | 94687287 | 2731 | 6741 |
| RC | 91.6 | 33.5 | 12.9 | 88.8 | 91.1 | – | 74.3 | 20864 | 32450624 | 192223 | 3419 |
| WM | 82.3 | 63.7 | 40.1 | 86.6 | 82.1 | – | 67.2 | 36189699 | 33998795 | 7589 | 639 |
|
| |||||||||||
| RC | 96.3 | 42.5 | 22.6 | 99.9 | 92.7 | 95.1 | 46.4 | 2144642 | 123719267 | 304490 | 8609 |
| RS | 92.5 | 39.6 | 19.1 | 94.4 | 89.0 | 92.0 | 43.6 | 2690420 | 110068092 | 134936 | 1516 |
| Red | 86.9 | 42.5 | 28.9 | 96.1 | 83.9 | 94.5 | 41.6 | 1794609 | 107704170 | 1653 | 1770 |
| WM | 68.1 | 83.4 | 83.6 | 3.5 | 68.9 | 95.4 | 44.4 | 170352943 | 186081334 | 13319 | 356 |
|
| |||||||||||
| Red | 94.7 | 93.8 | 96.5 | 1.0 | 94.3 | – | 54.9 | 2378281 | 10582912 | 61 | 235 |
| WM | 35.1 | 92.7 | 95.1 | 4.8 | 84.7 | – | 53.0 | 14238455 | 10769264 | 20 | 2 |
| RC | 95.0 | 25.0 | 7.6 | 0.0 | 31.9 | – | 13.6 | 0 | 1883829 | 18317 | 957 |
| RS | 79.4 | 27.0 | 4.6 | 0.0 | 30.4 | – | 13.5 | 0 | 1969788 | 17476 | 925 |
|
| |||||||||||
| WM | – | 91.2 | 90.6 | 28.5 | 89.2 | – | 61.4 | 10902380 | 10246592 | 23 | 7 |
| Red | – | 87.6 | 84.2 | 90.7 | 87.2 | – | 51.8 | 972553 | 8129833 | 63 | 416 |
| RS | – | 43.9 | 9.9 | 40.1 | 39.4 | – | 15.3 | 36882 | 1797419 | 36194 | 918 |
| RC | – | 20.3 | 7.7 | 34.5 | 19.1 | – | 9.3 | 4827 | 1314829 | 7011 | 1052 |
|
| |||||||||||
| Red | – | 88.7 | 81.0 | – | 88.5 | – | 44.0 | 160705 | 1914667 | 7 | 1 |
| WM | – | 63.6 | 33.3 | – | 63.0 | – | 17.5 | 672523 | 755248 | 2 | 2 |
| RS | – | 20.8 | 27.3 | – | 21.0 | – | 5.6 | 0 | 240852 | 331 | 640 |
| RC | – | 0.0 | 0.0 | – | 0.0 | – | 0.0 | 0 | 2089 | 69 | 852 |
SN is the sensitivity to all types of transposable elements. SN is the sensitivity to tandem repeats including microsatellites and satellites. SN is the sensitivity to low complexity regions. SN is the sensitivity to repeats that are not transposons, tandem repeats, or low complexity regions. SN is the sensitivity to all types of repeats. SP is the specificity to coding regions. PP stands for the percentage of the nucleotides of a chromosome predicted to be repeats. The False Positive Length (FPL) is the total length of repeats found in a synthetic random genome with the same length as the original genome; the synthetic genome is generated by a group of Markov chains of the 6 order. Each chain is trained on one real chromosome. Repeats found in the synthetic genome by RepeatMasker were removed. Potential Repeats (PR) is the number of nucleotides that were found in the repeats predicted by a tool but not in the repeats located by RepeatMasker. The symbol “bp” stands for base pair. “MB” represents the unit megabyte. The ‘a’ next to the PR indicates that these repeats are confirmed novel repeats
Examples of the confirmed novel repeats found by Red in the genome of the Homo sapiens (hg38)
| Location | Copy number | Length | Sequence |
|---|---|---|---|
| chr1:242110361–242110392 | 56058 | 31 | ACATTCAAGTGATTCTCCTGCCTCAGCCTCA |
| chr2:119996761–119996800 | 51817 | 39 | TCAATTGGCCGGGTGCGGTGGCTCACACCTGTAATCCCA |
| chr9:61982576–61982619 | 2292 | 43 | TTGGGATTTCAGGCGTGAGCCACTGTGCCTGGCCAGCATTGCT |
| chrX:129964870–129964913 | 1344 | 43 | TGTGTGTGGGTCTGTGTGTGAGAGAGAGAAAGAGAGAAACATG |
| chr16:55969556–55969604 | 1327 | 48 | GTACATATATATACGTGTGTGTGTGTGTGTGTGTATATATATAAATTA |
| chr19:20397318–20397528 | 324 | 210 | GCTTTGTTACAGTATTGGTTTCTGTCCACTATGAATTCTCTTATGTTTAT |
| TGAAGTCTGAGGACCAGTTAAAAGCTTTGCCACATTCTTCACATTTGCAA | |||
| GGTTTCTCTCCAGTATGAATTGTCTTATATTCACTTAGAGTTGAGGATGC | |||
| AGTAAAGGCTTTGCCACATTCTTCACATTTGTAAGGTTTCTCTCCAGTAT | |||
| GAGTTCTCCT | |||
| chr10:38932814–38933047 | 73 | 233 | ACTAGGGTAGGTAATTTCATCTCAGTCTTATGCAGGTACCTTTTCTCAGG |
| ATCTCAGGAATGCAGACTTCTCACACTTCTGTTCTTTTCCTGGCTGTGTT | |||
| GGTGAGCTCAGTGATATTCCTCCATCACCTTCAAGAGCAGTTTTGTTTTG | |||
| TTTTTCCTGTTTTCATACTCCCAGCATCAGGAGTGTTCTAGGTGTGTCAG | |||
| TTTTTGTTACCTTCCCCTACATATTAAGTGGAA | |||
| chr18:79830653–79831758 | 15a | 1105 | TTCCCTGCGGACAGAGCCTTTGTCAGGAGGGTTCCCTGCAGACAGAGCCT |
| TCGTCAGGAGGGTTCCCTGCAGACAGAGCCTTCGTCAGGAGGGTTCCCTG | |||
| CGGACAGAGCCTTCGTCAGGAGGGTTCCCTGCGGACAGAGCCTTCGTCAG | |||
| GAGGGTTCCCTGCATACAGAGCCTTCGTCAGGAGCGTTCTCTGCGGACAG | |||
| AGCCTTCGTCAGGAGGGTTCCCTGCATACAGAGCCTTCGTCAGGAGGGTT | |||
| CCCTGCGGACAGAGCCTTCGTCAGGAGGGTTCCCTGCGGACAGAGCCTTC | |||
| GTCAGGAGGGTTCCCTGCGGACAGAGCCTTCGTCAGGAGGGTTCCCTGCG | |||
| GACAGAGCCTTCGTCAGGAGGGTTCCCTGCGGACAGAGCCTTCGTCAGGA | |||
| GGGTTCCCTGCGGACAGAGCCTTCGTCAGGAGGGTTCCCTGCGGACAGAG | |||
| CCTTCGTCAGGAGGGTTCCCTGCGGACAGAGCCTTCGTCAGGAGGGTTCC | |||
| CTGCGGACAGAGCCTTCGTCAGGAGGGTTCCCTGCGGACAGAGCCTTCGT | |||
| CAGGAGGGTTCCCTGCGGACAGAGCCTTCGTCAGGAGGGTTCCCTGCGGA | |||
| CAGAGCCTTCGTCAGGAGGGTTCCCTGCGGACAGAGCCTTCGTCAGGAGG | |||
| GTTCCCTGCGGACAGAGCCTTCGTCAGGAGGGTTCCCTGCGGACAGAGCC | |||
| TTCGTCAGGAGGGTTCCCTGCGGACAGAGCCTTCGTCAGGAGGGTTCCCT | |||
| GCGGACAGAGCCTTCGTCAGGAGGGTTCCCTGCGGACAGAGCCTTCGTCA | |||
| GGAGGGTTCCCTGCGGACAGAGCCTTCGTCAGGAGGGTTCCCTGCGGACA | |||
| GAGCCTTCGTCAGGAGGGTTCCCTGCGGACAGAGCCTTCGTCAGGAGCGT | |||
| GCCCTGCGTACAGAGCCTTCGTCAGGAGCGTGCCCTGCGTACAGAGCCTT | |||
| CGTCAGGAGCGTGCCCTGCGTACAGAGCCTTCGTCAGGAGCGTGCCCTGC | |||
| GGACAGAGCCTTCGTCAGGAGGGTTCCCTGCGGACAGAGCCTTCGTCAGG | |||
| AGGGTTCCCTGCGGACAGAGCCTTCATCAGGAGGGTTCCCTGCGGACAGA | |||
| GCCTT |
The sequence chr18:79830653–79831758 has 15 overlapping copies, marked by ‘a’
Fig. 4Distribution of the confirmed novel repeats (CNR) found by Red in four human chromosomes. The unit mbp stands for mega base pair. Each chromosome is divided into 1-mbp segments, which are plotted on the x-axis. The total size of the confirmed novel repeats detected by Red in each of the 1-mbp segments is displayed on the y-axis. Segments spanning the centromere of the chromosome are colored in red. (a) The distribution of the CNR in the human chromosome 7. (b) The distribution of the CNR in the human chromosome 10. (c) The distribution of the CNR in the human chromosome 17. (d) The distribution of the CNR in the human chromosome 20
Copies of the 233-bp-long centromeric novel repeat: chr10:38932814–38933047. There are 73 copies of this novel repeat found throughout the human genome. The locations of 61 of these copies are known, whereas 12 of them are mapped to random segments of the genome. Out of the 61, 41 (67.0 %) copies are located in or within 1 mbp from the centromeres of several human chromosomes
| BLAST Hist | Identity (%) | Centromeric? | BLAST Hist | Identity (%) | Centromeric? |
|---|---|---|---|---|---|
| chr1:125097011–125096780 | 94.9 | yes | chr9:65753438–65753671 | 94.9 | no |
| chr1:125143875–125144108 | 92.7 | yes | chr10:42109674–42109443 | 95.3 | yesa |
| chr1:143279916–143279685 | 92.7 | no | chr10:42120842–42120611 | 94.9 | yesa |
| chr1:143535471–143535704 | 93.6 | no | chr15:19934467–19934700 | 94.9 | yes |
| chr2:90335400–90335172 | 94.4 | yesa | chr16:32106608–32106841 | 94.4 | no |
| chr2:91468090–91468320 | 94.8 | yes | chr16:32821244–32821013 | 94.4 | no |
| chr2:91800041–91800275 | 94.4 | yes | chr16:33049905–33050138 | 94.4 | no |
| chr2:92052731–92052499 | 94.4 | yes | chr16:34041061–34041294 | 95.3 | yesa |
| chr2:92076458–92076691 | 95.3 | yes | chr16:34460177–34459946 | 91.9 | yesa |
| chr2:94282644–94282411 | 94.5 | yes | chr16:34497844–34497619 | 94.3 | yesa |
| chr2:132008526–132008295 | 94.9 | no | chr16:34510019–34509788 | 94.0 | yesa |
| chr2:132047806–132047581 | 92.7 | no | chr16:34653787–34654020 | 93.1 | yesa |
| chr7:53130967–53130730 | 91.2 | no | chr16:34692352–34692578 | 93.4 | yesa |
| chr7:57878379–57878148 | 94.4 | yesa | chr16:34792862–34793095 | 93.1 | yesa |
| chr7:58101480–58101713 | 94.4 | yes | chr16:34838562–34838795 | 94.4 | yesa |
| chr7:60906762–60906995 | 94.9 | yes | chr16:34850733–34850960 | 94.3 | yesa |
| chr7:60982800–60983033 | 94.9 | yes | chr16:34888413–34888646 | 91.9 | yesa |
| chr7:61076250–61076019 | 94.9 | yes | chr16:34943100–34943333 | 94.9 | yesa |
| chr7:61583132–61582901 | 94.9 | yes | chr16:36163632–36163865 | 93.1 | yes |
| chr7:62318164–62317933 | 94.4 | yes | chr16:36222879–36223112 | 91.9 | yes |
| chr7:62398128–62398361 | 94.4 | yes | chr16:46425718–46425487 | 94.4 | no |
| chr7:62434242–62434475 | 94.9 | yes | chr16:46436413–46436188 | 92.3 | no |
| chr7:65114062–65114290 | 92.3 | no | chr17:26756876–26757109 | 95.7 | yes |
| chr7:65519041–65518815 | 92.3 | no | chr17:26953504–26953737 | 97.4 | yes |
| chr7:65581931–65581705 | 92.3 | no | chr18:15163745–15163514 | 95.7 | yes |
| chr9:40660714–40660483 | 94.9 | no | chr18:15207260–15207029 | 94.4 | yes |
| chr9:43290928–43290698 | 95.3 | yes | chr21:8585703 -8585935 | 94.0 | no |
| chr9:43313883–43314115 | 95.3 | yes | chr21:10618400–10618633 | 94.4 | yes |
| chr9:63460122–63459891 | 94.9 | no | chr22:10562609–10562841 | 94.0 | no |
| chr9:64784259–64784028 | 94.9 | no | chr22:16255856–16256089 | 95.7 | yes |
| chr9:65268071–65267840 | 94.9 | no |
Copies within 1 mbp from the centromeres are marked by ‘a’
The specificity to nucleotides comprising duplicated human genes. Duplicated genes were divided into the following three groups according to their copy numbers: the 2–4 group, the 5–9 group, and the 10-or-more group. SP is the percentage of the nucleotides of the genes in a group that are excluded by a tool
| Gene Copy Number | 2–4 | 5–9 | ≥10 |
|---|---|---|---|
| Length (bp) | 2,582,680 | 447,130 | 708,127 |
|
| |||
| ReCon | 95.4 | 95.1 | 90.4 |
| RepeatScout | 89.9 | 75.2a | 53.1a |
| Red | 89.4 | 87.8 | 68.5 |
| WindowMasker | 87.6a | 88.4 | 80.6 |
| RepeatMasker | 89.4 | 88.6 | 84.3 |
The lowest SP on a gene group is marked by ‘a’