| Literature DB >> 20333191 |
Cédric Feschotte1, Umeshkumar Keswani, Nirmal Ranganathan, Marcel L Guibotsy, David Levine.
Abstract
Eukaryotic genomes contain large amount of repetitive DNA, most of which is derived from transposable elements (TEs). Progress has been made to develop computational tools for ab initio identification of repeat families, but there is an urgent need to develop tools to automate the annotation of TEs in genome sequences. Here we introduce REPCLASS, a tool that automates the classification of TE sequences. Using control repeat libraries, we show that the program can classify accurately virtually any known TE types. Combining REPCLASS to ab initio repeat finding in the genomes of Caenorhabditis elegans and Drosophila melanogaster allowed us to recover the contrasting TE landscape characteristic of these species. Unexpectedly, REPCLASS also uncovered several novel TE families in both genomes, augmenting the TE repertoire of these model species. When applied to the genomes of distant Caenorhabditis and Drosophila species, the approach revealed a remarkable conservation of TE composition profile within each genus, despite substantial interspecific covariations in genome size and in the number of TEs and TE families. Lastly, we applied REPCLASS to analyze 10 fungal genomes from a wide taxonomic range, most of which have not been analyzed for TE content previously. The results showed that TE diversity varies widely across the fungi "kingdom" and appears to positively correlate with genome size, in particular for DNA transposons. Together, these data validate REPCLASS as a powerful tool to explore the repetitive DNA landscapes of eukaryotes and to shed light onto the evolutionary forces shaping TE diversity and genome architecture.Entities:
Keywords: genome annotation; repeat classification; repetitive elements; transposable elements; transposons
Year: 2009 PMID: 20333191 PMCID: PMC2817418 DOI: 10.1093/gbe/evp023
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
FOverview of the REPCLASS workflow. Subroutines are shown in italics in black boxes. Databases are shown in gray cylinders. Each input query sequence (typically a consensus) is analyzed by the three classification modules of REPCLASS. HOM: homology-based, searches similarity to known repeats deposited in Repbase using TBlastX and extract classification from keyword index file; STR: structure-based, several subroutines search for structural features characteristic of different group of TEs, such as terminal inverted repeats (TIR_search), LTRs (LTR_search), tRNA-like sequences (tRNAscan-SE), or polyA/SSRs (polyA/SSR_search); TSD: target site duplication, individual copies are extracted from the target genome sequence using BlastN and their flanking sequences are searched for TSD. If no TSD are found, the subroutine Helitron_scan is executed to look for structural features of Helitrons. The final step attempts to compare and integrate the results of the three modules, resulting in a tentative classification for each input sequence. For a complete description of the workflow and subroutines, see Results and Methods.
FValidation of REPCLASS with Repbase libraries. Venn diagrams showing the number of consensus sequences in the Repbase Update (RU) library of (A) C. elegans (n = 116) and (B) D. melanogaster (n = 144) classified by the different modules of REPCLASS.
FTE composition profiles generated by REPCLASS for (A) three Caenorhabditis species and (B) three Drosophila species. The profile depicts the percentage of families falling within one of the four TE subclasses (LTR retrotransposons, non-LTR retrotransposons, cut-and-paste DNA transposons, and Helitrons).
Genome Statistics and Annotation of TEs in Caenorhadbitis and Drosophila species
| DNA analyzed (Mb) | 100.3 | 138.4 | 170.4 | |||
| WGS coverage | n/a | 9.2X | 9.5X | |||
| Number of contigs | Chromosomes | 12,680 | 13,589 | |||
| Average contigs length (bp) | n/a | 10,915 | 12,545 | |||
| Number of families identified by RepeatScout | 445 | 1,368 | 1,477 | |||
| Number of families classified by REPCLASS | 146 | 331 | 362 | |||
| Number of families | Average consensus length (bp) | Number of families | Average consensus length (bp) | Number of families | Average consensus length (bp) | |
| DNA | 107 | 649 | 212 | 654 | 254 | 558 |
| 5 | 891 | 25 | 674 | 20 | 1,038 | |
| LTR | 11 | 1,403 | 28 | 1,070 | 21 | 1,073 |
| Non-LTR | 23 | 707 | 66 | 788 | 67 | 478 |
| DNA analyzed (Mb) | 137.7 | 146 | 189.2 | |||
| WGS coverage | n/a | 9.1X | 8.0X | |||
| Number of contigs | Chromosomes | 4,896 | 13,530 | |||
| Average contigs length (bp) | n/a | 29,832 | 13,984 | |||
| Number of families identified by RepeatScout | 810 | 1,673 | 1,743 | |||
| Number of families classified by REPCLASS | 464 | 855 | 868 | |||
| Number of families | Average consensus length (bp) | Number of families | Average consensus length (bp) | Number of families | Average consensus length (bp) | |
| DNA | 63 | 508 | 127 | 330 | 83 | 519 |
| 11 | 433 | 29 | 444 | 142 | 619 | |
| LTR | 218 | 1,411 | 415 | 766 | 424 | 1,222 |
| Non-LTR | 172 | 906 | 284 | 519 | 219 | 1,077 |
Genome Statistics and Annotation of TEs of 10 Fungal Species
| Species | Phylum | Genome Size (Mb) | WGS Coverage | Number of Contigs | Average Contig Length (kb) | Number of Repeat Families Identified | Number of TE Families Classified |
| Ascomycete | 14.3 | 10.9X | 8 | 1,787.2 | 37 | 7 | |
| Basidiomycete | 19.7 | 10X | 274 | 71.8 | 25 | 4 | |
| Ascomycete | 29.4 | 10.5X | 8 | 3,673.1 | 31 | 23 | |
| Ascomycete | 30 | 13X | 248 | 121.2 | 49 | 35 | |
| Ascomycete | 32.5 | 11X | 976 | 33.3 | 124 | 38 | |
| Ascomycete | 34.3 | 7X | 1,245 | 27.6 | 176 | 70 | |
| Basidiomycete | 36.2 | 10X | 431 | 84.1 | 178 | 48 | |
| Zygomycete | 45.3 | 12X | 389 | 116.3 | 496 | 127 | |
| Ascomycete | 59.9 | 6.8X | 1,362 | 44.0 | 516 | 204 | |
| Basidiomycete | 81.5 | 7.8X | 4,557 | 17.9 | 2,085 | 430 |
FRelationship between genome size and the number of TE families classified by REPCLASS in 10 fungal genomes.
FTE composition profiles generated by REPCLASS for 10 fungal genomes. The species are ranked by increasing genome size from left to right. For taxonomic information, see table 2 and supplementary table 1 (Supplementary Material online), and for a phylogenetic relationships, see Fitzpatrick et al. (2006).
Data Summary on REPCLASS False-Positive Rate
| Data Set | Number of False Positives | Number of TE Examined | False-Positive Rate |
| Control libraries | 10 | 247 | 0.04 |
| Ab initio—“known families” | 14 | 150 | 0.09 |
| Ab initio—“new families” | 2 | 100 | 0.02 |
| Total | 26 | 497 | 0.05 |
For each data set, the results obtained for C. elegans and D. melanogaster were combined.
Number of TE families misclassified by REPCLASS, based on comparison of the classification given by REPCLASS to the one provided by Repbase (b and c) or by “manual” inspection (d).
See section “Validation with reference repeat libraries.”
See section “TE annotation by combining REPCLASS with ab initio repeat finding.”
See section “REPCLASS-assisted discovery of new TE families in C. elegans and D. melanogaster.”