| Literature DB >> 21304975 |
Timothée Flutre1, Elodie Duprat, Catherine Feuillet, Hadi Quesneville.
Abstract
Transposable elements (TEs) are mobile, repetitive DNA sequences that are almost ubiquitous in prokaryotic and eukaryotic genomes. They have a large impact on genome structure, function and evolution. With the recent development of high-throughput sequencing methods, many genome sequences have become available, making possible comparative studies of TE dynamics at an unprecedented scale. Several methods have been proposed for the de novo identification of TEs in sequenced genomes. Most begin with the detection of genomic repeats, but the subsequent steps for defining TE families differ. High-quality TE annotations are available for the Drosophila melanogaster and Arabidopsis thaliana genome sequences, providing a solid basis for the benchmarking of such methods. We compared the performance of specific algorithms for the clustering of interspersed repeats and found that only a particular combination of algorithms detected TE families with good recovery of the reference sequences. We then applied a new procedure for reconciling the different clustering results and classifying TE sequences. The whole approach was implemented in a pipeline using the REPET package. Finally, we show that our combined approach highlights the dynamics of well defined TE families by making it possible to identify structural variations among their copies. This approach makes it possible to annotate TE families and to study their diversification in a single analysis, improving our understanding of TE dynamics at the whole-genome scale and for diverse species.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21304975 PMCID: PMC3031573 DOI: 10.1371/journal.pone.0016526
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Sensitivity and specificity of the programs tested in the three-step de novo approach.
| Genome | Self-alignment | Clustering | Multiple alignment | Sn* | Sp* | RCC |
|
| BLASTER | GROUPER | MAP | 80.34% | 85.89% | 66.20% |
|
| BLASTER | RECON | MAP | 92.31% | 73.17% | 66.20% |
|
| BLASTER | PILER | MAP | 62.39% | 84.17% | 51.50% |
|
| PALS | GROUPER | MAP | 73.50% | 88.75% | 60.30% |
|
| PALS | RECON | MAP | 90.60% | 74.23% | 51.50% |
|
| PALS | PILER | MAP | 53.85% | 76.42% | 42.64% |
|
| BLASTER | GROUPER | MAP | 60.33% | 82.42% | 39.00% |
|
| BLASTER | RECON | MAP | 73.77% | 61.70% | 43.50% |
|
| BLASTER | PILER | MAP | 47.21% | 57.33% | 32.45% |
|
| PALS | GROUPER | MAP | 54.75% | 88.38% | 24.00% |
|
| PALS | RECON | MAP | 71.80% | 66.20% | 27.90% |
|
| PALS | PILER | MAP | 40.00% | 59.92% | 16.20% |
“D. mel.” stands for “D. melanogaster” and “A. tha.” stands for “A. thaliana”. The three indices Sn*, Sp* and RCC correspond respectively to the measure of sensitivity, the measure of specificity and the recovery ratio when comparing a databank of TE de novo consensus sequences with a databank of TE reference sequences.
Figure 1Schematic diagram of the dynamics of a TE family with two structural variants.
Figure 2Venn diagram showing the gains achieved by combining several clustering programs.
(A) Combining the GROUPER and RECON programs in particular makes it possible to fully recover more TE sequences than each program alone from the D. melanogaster genome. (B) Same conclusion from the A. thaliana genome.
Figure 3Simplified decision tree implemented in the TE classifier.
TEclassifier results for the classification of D. melanogaster TE sequences.
| Classification | Reference TEs from the BDGP |
|
|
| Class I “complete” LTR retrotransposon | 56 | 150 | 48 |
| Class I “incomplete” LTR retrotransposon | 2 | 377 | 209 |
| Class I “complete” LINE | 23 | 117 | 27 |
| Class I “incomplete” LINE | 17 | 147 | 57 |
| Class I SINE | 0 | 2 | 1 |
| Class II “complete” DNA transposon | 19 | 30 | 13 |
| Class II “incomplete” DNA transposon | 2 | 75 | 32 |
| Class II MITE | 0 | 8 | 5 |
| Helitron | 1 | 0 | 0 |
| SSR | 0 | 8 | 8 |
| Host genes | 0 | 26 | 11 |
| Confused | 1 | 20 | 6 |
| No category | 5 | 341 | 176 |
| Total | 126 | 1301 | 593 |
TE annotation results obtained with reference databanks and de novo databanks.
| Genome | TE databank | Consensus (having copies) | TE genome coverage | Number of copies | Sn | Sp |
|
| BDGP | 125 | 10.51% | 31208 | NA | NA |
|
| GROUPER | 712 | 10.29% | 43699 | 81.92% | 98.12% |
|
| RECON | 437 | 11.05% | 33072 | 87.77% | 97.95% |
|
| PILER | 114 | 8.87% | 32789 | 74.07% | 98.79% |
|
| RepeatScout | 1432 | 10.86% | 42048 | 85.28% | 97.88% |
|
| G+R+P | 568 | 11.98% | 42847 | 91.43% | 97.35% |
|
| Repbase | 318 | 19.02% | 41146 | NA | NA |
|
| GROUPER | 1237 | 18.78% | 41791 | 79.29% | 95.43% |
|
| RECON | 1004 | 23.69% | 49470 | 88.75% | 91.59% |
|
| PILER | 300 | 13.14% | 34818 | 56.56% | 97.05% |
|
| RepeatScout | 2893 | 21.95% | 68958 | 82.91% | 92.36% |
|
| G+R+P | 1232 | 22.77% | 44059 | 87.03% | 92.32% |
“D. mel.” stands for “D. melanogaster” and “A. tha.” stands for “A. thaliana”. The Sn and Sp columns correspond respectively to sensitivity and specificity results when comparing two annotations in terms of nucleotide overlaps. “G+R+P” indicates that the three programs GROUPER, RECON and PILER were used to build the databank of de novo consensus sequences.
Figure 4Extensive structural variations within several TE families.
Each image provides an overview of a multiple alignment, a column being in one color if all the residues within it are identical. In all the images, the first sequence in the multiple alignment (red star) is the TE reference sequence from a public databank (BDGP or Repbase). For alignments A to D, the second sequence (blue star) is the only de novo consensus in which the TE reference sequence is fully recovered by only one clustering method. All sequences below (in brackets) are TE genomic copies found by the de novo consensus analysis. For alignments E to H, the sequences below the TE reference sequence are de novo consensus that require manual curation. Beside is indicated the program that build them, “R” for RECON and “G” for GROUPER.