| Literature DB >> 19094224 |
Abstract
BACKGROUND: Dispersed repeats are a major component of eukaryotic genomes and drivers of genome evolution. Annotation of DNA sequences homologous to known repetitive elements has been mainly performed with the program REPEATMASKER. Sequences annotated by REPEATMASKER often correspond to fragments of repetitive elements resulting from the insertion of younger elements or other rearrangements. Although REPEATMASKER annotation is indispensable for studying genome biology, this annotation does not contain much information on the common origin of fossil fragments that share an insertion event, especially where clusters of nested insertions of repetitive elements have occurred.Entities:
Mesh:
Substances:
Year: 2008 PMID: 19094224 PMCID: PMC2672092 DOI: 10.1186/1471-2164-9-614
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Re-annotation algorithm. A. Graphic representation of the input REPEATMASKER annotation of the first 100 Kb of genomic sequence [GENBANK:AF123535.1] around the adh gene of the maize cultivar LH_82 (this refers to the same maize sequence that was manually annotated in [45] and used to validate REANNOTATE's predictions in Results, Table 1, and Figure 3). B. Boxes highlighted in magenta on the bottom tier represent hits to the reference element PREM2_ZM_I (the internal region of an LTR-retrotransposon in REPBASE), of which the three innermost hits, shown again in red on the top tier united by a horizontal line, were defragmented by REANNOTATE into a repetitive element model. The black arrows show the orientation of the hits on the chromosome, and the three hits shown in red are colinear with the reference PREM2_ZM_I sequence. C. Boxes highlighted in blue on the bottom tier represent hits to the reference LTR sequence PREM2_ZM_LTR (in REPBASE). Above, two (single-hit) LTR models (shown in orange) flank an IR model (in red): these three models have been assembled into a higher-order model of an element of the PREM2_ZM family. D. The chromosomal span of the defragmented PREM2_ZM element (red and orange) is within the span of another element (bottom model in black); the PREM2_ZM element is inferred to have inserted into the element shown below it. Two other elements (black boxes on top tier) are inferred to have inserted into the PREM2_ZM element. E. Pairs of intra-element LTR sequences are output, aligned with CLUSTALW, and the number of point substitutions between them estimated.
Comparison between automated and human annotation of TEs
| repeat | hits | nests | time ± s.d. (Mya) | type | ||||||
| a | 2 | 2 | ... | - | ... | - | LTR | |||
| b | 1 | 1 | ... | - | ... | - | LTR | |||
| c | 1 | 0 | - | - | - | - | LTR | |||
| d | 2 | 0 | - | - | - | - | LTR | |||
| e | 00081 | 3 | 0 | >2.4 ± 1.4 | > .18 ± .15 | LTR | ||||
| * | 00081 | 1 | 0 | - | - | - | - | LTR | ||
| f | 3 | 1 | 2.4 ± 1.4 | 0.18 ± 0.15 | LTR | |||||
| g | 00098 | 5 | 0 | 18.2 ± 4.1 | 1.40 ± 0.44 | LTR | ||||
| h | 3 | 1 | 15.3 ± 3.1 | 1.18 ± 0.34 | LTR | |||||
| i | 00093 | 6 | 0 | 30.7 ± 18 | 2.36 ± 1.92 | LTR | ||||
| j | 1 | 1 | - | < 31 ± 18 | - | < 2.4 ± 1.9 | LTR | |||
| k | 5 | 1 | 24.7 ± 4.7 | 1.90 ± 0.51 | LTR | |||||
| l | 3 | 2 | 6.4 ± 2.3 | 0.49 ± 0.25 | LTR | |||||
| m | 1 | 2 | - | < 25 ± 5 | - | < 1.9 ± 0.5 | LTR | |||
| n | 3 | 1 | 20.8 ± 4.1 | 1.60 ± 0.44 | LTR | |||||
| o | 4 | 0 | 26.4 ± 9.4 | 2.03 ± 1.02 | LTR | |||||
| p | 4 | 1 | 3.4 ± 2.4 | 0.26 ± 0.26 | LTR | |||||
| q | 00243 | 2 | 1 | ... | - | ... | - | LTR | ||
| 1 | 2 | 0 | - | - | - | - | LTR | |||
| 2 | 1 | 0 | - | - | - | - | LTR | |||
| 3 | 4 | 0 | 26.6 ± 4.2 | 2.04 ± 0.46 | LTR | |||||
| 1 | - | 1 | - | < 27 ± 4 | - | < 2.0 ± 0.5 | LTR | |||
| 1 | - | 1 | - | < 27 ± 4 | - | < 2.0 ± 0.5 | LTR | |||
| 4 | 1 | 1 | - | < 27 ± 4 | - | < 2.0 ± 0.5 | LTR | |||
| 5 | 1 | 0 | - | - | - | - | LTR | |||
| 6 | 1 | 0 | - | - | - | - | MITE | |||
| 7 | 1 | 0 | - | - | - | - | MITE | |||
| 8 | 3 | 0 | 10.8 ± 5.4 | 0.83 ± 0.59 | LTR | |||||
| 9 | 3 | 0 | - | > 41 ± 6 | - | > 3.2 ± 0.6 | LTR | |||
| 10 | 3 | 1 | 13.1 ± 5.4 | 1.01 ± 0.58 | LTR | |||||
| 11 | 3 | 1 | 41.4 ± 5.6 | 3.18 ± 0.60 | LTR | |||||
| 12 | 6 | 0 | - | > 31 ± 4 | - | > 2.4 ± 0.4 | LTR | |||
| 13 | 1 | 1 | - | - | LTR | |||||
| 4 | 1 | 30.7 ± 3.5 | 2.36 ± 0.37 | LTR | ||||||
| 14 | 2 | 2 | - | < 31 ± 4 | - | < 2.4 ± 0.4 | LTR | |||
| 15 | 3 | 2 | 19.9 ± 3.4 | 1.53 ± 0.37 | LTR | |||||
| 16 | 2 | 0 | - | - | - | - | LTR | |||
| 17 | 3 | 0 | 57.0 ± 6.0 | 4.38 ± 0.64 | LTR | |||||
| 18 | 3 | 0 | > 39 ± 5 | > 3.0 ± 0.6 | LTR | |||||
| 3 | 0 | - | - | LTR | ||||||
| 19 | 5 | 1 | 39.1 ± 5.4 | 3.01 ± 0.58 | LTR | |||||
| 20 | 4 | 2 | 35.9 ± 4.9 | 2.76 ± 0.52 | LTR | |||||
| 21 | 3 | 3 | 31.6 ± 4.8 | 2.43 ± 0.52 | LTR | |||||
| 22 | 1 | 2 | - | < 39 ± 5 | - | < 3.0 ± 0.6 | LINE | |||
| 23 | 3 | 1 | - | - | - | - | LTR | |||
| 24 | 3 | 0 | 73.1 ± 18 | 5.62 ± 1.87 | LTR | |||||
| 25 | 1 | 0 | - | - | - | - | MITE | |||
| 26 | 1 | 0 | - | - | - | - | MITE | |||
| 27 | 2 | 0 | - | - | - | - | LTR | |||
| 28 | 2 | 1† | - | - | - | - | LTR | |||
Manual annotation results of maize [45] and diploid wheat [51] sequences are shown in italics. REANNOTATE results are shown in regular font style. Only elements spanning sequences that were annotated as TEs both in the manual annotation and in the input (REPEATMASKER) to the automated re-annotation are listed. In the first column letters indicate maize elements and correspond to labels in Figure 3C, numbers indicate wheat elements and labels in Figure 4.
Uppercase names correspond to reference element sequences in REPBASE UPDATE (RU), numbers correspond to reference sequences in the TIGR ZEA REPEAT DATABASE. Rows without an entry for the manually annotated repeat name indicate that REANNOTATE constructed multiple models (one model per row) corresponding to a single element in the manual annotation: for instance, Sabrina_F2-2 corresponds to three automated models, a result due to the fact that (parts of) different RU reference elements, SABRINA2_TM, SABRINA3_TM and SABRINA_HV are closely related, and were best matches (annotated by REPEATMASKER) to different segments of Sabrina_F2-2.
Number of similarity hits reported by REPEATMASKER that were defragmented into a single element model by REANNOTATE.
Number of repetitive elements nesting a given element. (†) The first wheat element listed was annotated in [51] to be inserted into a TE sequence with no detectable similarity to reference elements in RU (absent form the input REPEATMASKER annotation); the last wheat element was annotated by REANNOTATE to be interrupting a fragment of an element homologous to CLAUDIA1_TM, which is not present in the manual annotation.
Estimated number of nucleotide substitutions between intra-element LTRs. (*) REANNOTATE did not date Milt because the 3' LTR is in inverse orientation relative to the rest of the element: an element model was built including the Milt 5' LTR and internal region, and another model for the 3' LTR. (‡) Eway G1-1 was originally annotated as having identical LTRs, but they are in fact quite divergent. (...) Elements Ji-6, Tekay and Kake-1 were dated in the original annotation, but these elements are truncated at the ends of the available 160 Kb of contiguous sequence re-annotated here.
Estimated time of insertion (million years ago), obtained with the substitution rate for the adh loci of grasses [66]. The standard deviations computed by REANNOTATE are larger than in the manual annotation: in the latter the variance in time was propagated from the variance in K, whilst REANNOTATE additionally accounts for the Poisson variance (stochasticity) in the accumulation of nucleotide substitutions.
Figure 3Layers of re-annotation: maize adh1 locus. A. Representation of the input REPEATMASKER annotation of 160 Kb of sequence around the adh1 locus of maize cultivar LH82. In the bottom tier pale yellow boxes correspond to un-masked sequence (no similarity to known repeats), dark vertical lines indicate low-complexity repeats. Boxes in top tier represent similarity hits to dispersed repeats, separated by vertical lines. Red boxes represent hits to the IR of LTR-elements, orange boxes LTRs, lilac boxes non-LTR retrotransposons, dark green boxes DNA transposons, and the pink box an unknown type of repeat. B. Detail of re-annotation layer 1: Defragmentation. The two bottom tiers represent a portion of the REPEATMASKER annotation shown in A, whilst boxes above indicate 3 IR hits (third tier from bottom) and 3 LTR hits (top tier) defragmented into a single model, and therefore inferred to share an insertion event. This element is labeled 'i' in C and in Table 1, and corresponds to element Victim in 45. (Three hits modeled as part of the same IR are linked by a red line, two hits modeled as part of the second LTR linked by an orange line). C. Re-annotation layer 2: Nesting Structure. Overlapping element models are shown in their order of insertion as resolved by REANNOTATE. (Figure created by rendering in APOLLO GFF annotation automatically generated by REANNOTATE). Letters label LTR-elements in Table 1 that were annotated in 45. D. Re-annotation layer 3: Time. For 'complete' LTR-elements the number of substitutions (vertical scale) between intra-element LTRs and the time since insertion have been automatically estimated. Upper bounds on the ages of two solo LTRs (labeled 'j' and 'm') could be placed as the elements are inserted into complete LTR-elements. Double-headed arrows span two standard deviations around estimates of K. This figure may be compared to Figures 1 and 2 in 45.
Figure 2Sequence output. The element model shown in green (in A) defragmented three regions of human chromosome Y homologous to segments of the reference sequence of the DNA transposon family MER58B (CHESHIRE_B), shown in B. Hits marked 1 and 2 (in A) are separated on the chromosome by only 26 bp, but in the output model sequence (shown in C) their respective sequences are separated by an internal gap of length 79 – this is the distance along the reference MER58B separating its segments that are aligned to hits 1 and 2. In contrast, the sequences of hits 2 and 3 are output contiguously (without an intervening gap) because they match contiguous segments of MER58B – even though the corresponding chromosomal regions are separated by an ALUSX SINE insertion (blue box above the green model in A). The terminal gap in the model sequence is added to indicate that the annotated alignment of hit 3 ends five nucleotide positions short of the 5' terminus of the MER58B sequence.
Figure 4Re-annotation of highly nested TEs in the diploid wheat genome. Re-annotation of repeats in a 215 Kb region of Triticum monococcum chromosome 5A. Numbers label elements listed in Table 1. Colour scheme follows Figure 3 (except that thin green boxes are specifically MITEs, and on the bottom tier unique genes are shown as light blue boxes). Label "18" appears twice and corresponds to element Sabrina_G1-1 annotated in [51] (Fig. 1 in this reference may be compared with this figure); REANNOTATE constructed two separate models because when run with default parameters the maximum span of a model is 40 Kb, which is exceeded by the chromosomal span of Sabrina_G1-1. Element models marked with a "*" above a horizontal bar were annotated as part of element Sabrina_F2-2 in [51], which corresponds to label "3" in this figure (see section 'Limitations and scope for development'). This figure was rendered in APOLLO from a GFF annotation file generated by REANNOTATE.
Accuracy of REANNOTATE's inferences relative to manual annotation
| Defragmentation | Nesting | Time | ||
| Sensitivity | Specificity | |||
| maize (AF123535.1) | 97.8% | 100.0% | 100.0% | 100.0% |
| wheat (AF459639.1) | 96.0% | 100.0% | 93.3% | 90.9% |
Accuracy of predictions in the Defragmentation layer of re-annotation is given by their sensitivity and specificity according to the formulas and respectively, where TP is the count of true positives, TN true negatives, FP false positives, and FN the count of false negatives (see Implementation) relative to the manual annotations. Here a 'prediction' refers to a sequence similarity hit reported by REPEATMASKER that has been defragmented into a TE model by REANNOTATE. For the Nesting Structure and Time layers accuracy is given as the proportion of predictions in agreement with the original manual annotation [45,51].
Figure 5Segmental multiplication within a TE cluster in the fly genome. Re-annotation of a cluster of repeats in the Drosophila melanogaster chromosome arm 3R. The scale shows chromosomal coordinates (Release 3.1 genome sequence). Visualisation scheme as in Figure 3, except that "element" models – displayed as boxes united by horizontal lines – no longer indicate sequences sharing an insertion (transposition) event; here a model indicates sequences that resulted from segmental multiplication subsequent to an original insertion. Note the high periodicity of the arrangement. The LTR-elements displayed immediately above the bottom tier all belong to the COPIA2 family, the sequences marked with a '*' are all INVADER1 LTRs, and the ones marked with a black bar are MICROPIA elements. REANNOTATE infers that the COPIA2 sequences have been involved in DNA rearrangements other than transposition of an entire element. It is likely that the INVADER1 LTR was inserted in a COPIA2 LTR prior to the multiplication of the latter. The green box to the left indicates (subsequent) insertion of a PROTOP_A element, and the ones on the right S-elements. (All family names given as in the RU database). This figure was rendered in APOLLO from a GFF annotation file generated by REANNOTATE, and it may be compared with Figure 2 in ref. [32].
Nesting of repeats in human chromosomes Y and 2
| no. of elements | % chromosome | % nested | insertions/Kb | |||||
| Y | 2 | Y | 2 | Y | 2 | Y | 2 | |
| SINE | 10068 | 119383 | 10% | 12% | 46% | 51% | 0.48 | 0.63 |
| LINE | 6207 | 65661 | 25% | 21% | 32% | 39% | 0.81 | 1.66 |
| LTR | 5200 | 38395 | 18% | 8% | 43% | 52% | 0.57 | 0.68 |
| satellite | 2144 | 222 | 5% | 0% | 4% | 37% | 0.05 | 0.18 |
| DNA | 1464 | 26886 | 2% | 3% | 35% | 46% | 1.12 | 0.88 |
| RNA | 55 | 778 | 0% | 0% | 54% | 56% | 0.36 | 0.26 |
| repeats total | 25157 | 251667 | 59% | 45% | 38% | 48% | 0.65 | 1.19 |
| non-repetitive‡ | - | - | 41% | 55% | - | - | 1.33† | 1.00 |
Number of repetitive elements (models) in each group. Percentage of available (cytologically euchromatic) chromosome sequence. Percentage of elements in each group that are inserted into any other repetitive element. Density of repetitive element insertions (of any kind) per kilo base pairs of sequence within each group. ‡Non-repetitive sequence here means that similarity to known families of high complexity repeats was not detected, but it includes low-complexity repeats. †Excludes satellite repeats arrayed in tandem. (Satellites considered are high-complexity repeats).
Figure 6Age distribution of endogenous retroviruses in the human genome. Age distributions of endogenous retroviruses (ERVs) from the automated re-annotation of three human chromosomes. ERV intra-element LTRs on chromosome Y are significantly more divergent than those on chromosomes 2 and 1. Top axis shows the number of substitutions per kilo base pairs of intra-element LTR alignments. Age estimates obtained with a rate of 2.1 × 103 substitutions per site per million years [57].