| Literature DB >> 31056858 |
Lukas Weilguny1, Robert Kofler1.
Abstract
Transposable elements (TEs) are selfish DNA sequences that multiply within host genomes. They are present in most species investigated so far at varying degrees of abundance and sequence diversity. The TE composition may not only vary between but also within species and could have important biological implications. Variation in prevalence among populations may for example indicate a recent TE invasion, whereas sequence variation could indicate the presence of hyperactive or inactive forms. Gaining unbiased estimates of TE composition is thus vital for understanding the evolutionary dynamics of transposons. To this end, we developed DeviaTE, a tool to analyse and visualize TE abundance using Illumina or Sanger sequencing reads. Our tool requires sequencing reads of one or more samples (tissue, individual or population) and consensus sequences of TEs. It generates a table and a visual representation of TE composition. This allows for an intuitive assessment of coverage, sequence divergence, segregating SNPs and indels, as well as the presence of internal and terminal deletions. By contrasting the coverage between TEs and single copy genes, DeviaTE derives unbiased estimates of TE abundance. We show that naive approaches, which do not consider regions spanned by internal deletions, may substantially underestimate TE abundance. Using published data we demonstrate that DeviaTE can be used to study the TE composition within samples, identify clinal variation in TEs, compare TE diversity among species, and monitor TE invasions. Finally we present careful validations with publicly available and simulated data. DeviaTE is implemented in Python and distributed under the GPLv3 (https://github.com/W-L/deviaTE).Entities:
Keywords: zzm321990pythonzzm321990; assembly free; data visualization; divergence; mobile genetic element; transposon
Mesh:
Year: 2019 PMID: 31056858 PMCID: PMC6791034 DOI: 10.1111/1755-0998.13030
Source DB: PubMed Journal: Mol Ecol Resour ISSN: 1755-098X Impact factor: 7.090
Figure 1Example of the visualization of TE diversity with DeviaTE using burdock from D. melanogaster. Sequencing coverage is shown separately for unambiguously (dark grey) and ambiguously (light grey) mapped reads. Fixed differences and polymorphic sites are shown as coloured bars, with the height of the bar corresponding to the frequency of the SNP. The reference allele is not shown in the visualization. Internal deletions are displayed as arcs, where the width of the arcs scales with the abundance of the deletion. Terminal deletions are shown as dashed lines, with their opacity indicating the abundance of the deletion (darker lines indicate higher abundance). An annotation of the TE is shown at the bottom. Note that ambiguously mapped regions coincide with the long LTRs of burdock. Data are from a D. melanogaster line caught in the Netherlands (Grenier et al., 2015)
Figure 3Validation of DeviaTE with simulated data. (a) Comparison between simulated and observed sequence divergence. DeviaTE accurately recovers simulated divergence of up to 15% for short reads (100 bp) and 22% for long reads (1,000 bp). Notably, the accuracy does not increase linearly with the read length. (b) Error of the estimated coverage dependent on the simulated divergence of reads. DeviaTE accurately reproduces the simulated coverage if the mismatch rate is smaller than 8% and 16% for short and long reads, respectively. Lower divergence levels are tolerated for indels. (c) Accuracy of allele frequency estimates dependent on the divergence. DeviaTE accurately reproduces allele frequencies of SNPs up to a divergence of 15%. (d) Accuracy of estimated frequencies of internal deletions. Since raw frequency estimates show a small bias (left), we implemented a read length dependent correction factor (right, inset), which substantially improves the accuracy of frequency estimates (right). Note that in a, c, and d a diagonal would indicate perfect agreement between expected and observed values
Figure 2An invasion of the P‐element in an experimental Drosophila simulans population visualized with DeviaTE (data from Kofler et al., 2018). We show the abundance and the diversity of the P‐element for four successive time points. The coverage was normalized to one million mapped reads and estimates of insertions per haploid genome () were calculated by relating the total coverage of the P‐element to the coverage of the gene rpl32. Note that the abundance of P‐elements as well as the number of internally deleted variants increases during the invasion
Comparison of different tools for analyzing TE abundance. The required input, the resulting output, notable features and shortcomings are shown for each tool (RepeatMasker [Smit et al., 1996‐2010] RepeatExplorer [Novák et al., 2013], dnaPipeTE [Goubert et al., 2015], RepLong [Guo et al., 2017] and DeviaTE)
| DeviaTE | RepeatMasker | RepeatExplorer | dnaPipeTE | RepLong | |
|---|---|---|---|---|---|
| Method | Alignment of reads to TEs | Alignment of TEs to assembly | De novo assembly | De novo assembly | De novo assembly |
| Input | Sequencing reads, TE sequences | Genome assembly, TE sequences | Sequencing reads, TE sequences | Sequencing reads (single‐end only), TE sequences, genome size estimate | Sequencing reads, genome size estimate |
| Output | Variation within TE families, visualization of TEs, quantification of variation, estimates of TE abundance | Annotation of repeats, masked query sequence, genome proportion of repeat orders, divergence to consensus | TE contigs, genome proportion of TEs, abundance of contigs | TE contigs, genome proportions of TEs, estimates of relative age of TEs, abundance of contigs | TE contigs |
| Notable features | Divergence at nucleotide resolution, short and long reads, detects structural variants of TEs, container‐type installation, read preprocessing | Identify low complexity DNA, detect contamination in assembly, different search engines | Platform independent Galaxy server, read preprocessing, protein domain search, identification of novel repeats, suitable for low‐coverage sequencing | Identification of novel repeats, suitable for low‐coverage sequencing | Supports long‐reads, sensitive algorithm, suitable for low‐coverage sequencing, no TE library required |
| Shortcomings | No genomic position of TEs, no novel repeats | No quantification of families, no novel repeats, susceptible to low assembly quality | No genomic position of TEs, long runtimes | Installation requires RepBase subscription, no genomic position of TEs, no direct quantification of families | No quantification of families, does not consider sequencing quality, no genomic position of TEs |
| Availability (Win/Mac/Linux) | −/+/+ | −/+/+ | +/+/+ | −/−/+ | −/−/+ |