| Literature DB >> 31892957 |
Pol Vendrell-Mir1, Fabio Barteri1, Miriam Merenciano2, Josefa González2, Josep M Casacuberta1, Raúl Castanera1.
Abstract
BACKGROUND: Transposable elements (TEs) are an important source of genomic variability in eukaryotic genomes. Their activity impacts genome architecture and gene expression and can lead to drastic phenotypic changes. Therefore, identifying TE polymorphisms is key to better understand the link between genotype and phenotype. However, most genotype-to-phenotype analyses have concentrated on single nucleotide polymorphisms as they are easier to reliable detect using short-read data. Many bioinformatic tools have been developed to identify transposon insertions from resequencing data using short reads. Nevertheless, the performance of most of these tools has been tested using simulated insertions, which do not accurately reproduce the complexity of natural insertions.Entities:
Keywords: Benchmark; Polymorphism; Resequencing; Transposable elements; Transposon insertion
Year: 2019 PMID: 31892957 PMCID: PMC6937713 DOI: 10.1186/s13100-019-0197-9
Source DB: PubMed Journal: Mob DNA
Tools selected for the benchmark of TE insertions
| Tool | Target | Prediction | Input | Output format | Perceived difficulty | Manual | |
|---|---|---|---|---|---|---|---|
| Installation | Input preparation | ||||||
| RelocaTE2 | Non-reference insertions | All families | fastq | gff file | Easy | Easy | |
| Jitterbug | Non-reference insertions | All families | Bam | gff file | Medium | Medium | |
| Retroseq a | Non-reference insertions | All families | Bam | vcf file | Easy | Difficult | |
| ITIS | Non-reference insertions | Single-family | fastq | Bed file | Easy | Medium | |
| MELT | Reference and non-reference insertion | Single-family | Bam | vcf file | Easy | Medium | |
| PopoolationTE2 | Reference and non-reference insertions | All families | fastq | Tool-specific | Easy | Easy | |
| Teflon | Reference and non-reference insertions | All families | fastq | Tool-specific | Medium | Medium | |
| Trackposon | Reference and non-reference insertions | Single-family | fastq | Bed file | Easy | Easy | |
| TEMP a | Reference and non-reference insertions | All families | Bam | Tool-specific | Easy | Difficult | |
| TE-locate a | Reference and non-reference insertions | All families | Sam | Tool-specific | Easy | Difficult | |
| PopoolationTE a | Reference and non-reference insertions | All families | fastq | Tool-specific | Easy | Difficult | |
| ngs_te_mapper a | Reference and non-reference insertions | All families | fastq | Bed file | Easy | Difficult | |
| McClintock | Reference and non-reference insertion | All families | fastq | Bed file | Easy | Difficult | |
a These tools were run as part of the McClintock pipeline. Perceived difficulty refers to McClintock and not the original methods
Installation: Easy = available in Conda, or automatic / semi-automatic installation. Medium = Needs several dependencies or specific versions of packages that need manual installation. Input preparation: Easy = can be run using common formats (ie fasta, bed) without the need of specific formatting. Medium = Needs specific formatting. Difficult = Needs very specific formatting
Annotation of LTR-retrotransposons and MITEs in rice assemblies
| TE Classification | Nipponbare | MH63 |
|---|---|---|
| LTR-all a | 131,905 | 117,362 |
| LTR full-length b | 3733 | 3787 |
| LTR- Gypsy | 1354 | 1303 |
| LTR- Copia | 944 | 759 |
| LTR- Unclassified c | 1435 | 1725 |
| MITE-all (1) | 211,732 | 191,113 |
| MITE full-length d | 45,963 | 46,725 |
a Repeatmasker fragments. Includes both intact and truncated elements
b High confidence elements containing intact LTRs, TSDs and coding domains
c Intact elements whose poor coding domain conservation doesn’t allow proper classification
d Elements spanning more than 80% of its family consensus length
Fig. 1Density of MITEs (a) and LTR-retrotransposons (b) along the rice chromosome 5 (window size = 50 Kb). Black circles represent centromeres. Track 1 shows the density of all elements annotated in the chromosome by RepeatMasker. Track 2 shows the density of full-length elements. Track 3 shows the density of validated non-reference insertions (MH63-specific insertions) in the benchmarking standard. Tracks 4–8 show the density of non-reference predictions of five tools
Fig. 2Individual validation of predicted insertions. Black boxes represent TE annotations in Nipponbare IRGSP (green rectangle) and MH63 (blue rectangle) assembled genomes. Examples of shared (reference) and MH63-specific (non-reference) insertions are shown in a. Insertions predicted by each tool (shown as arrows in b) were intersected with windows of 500 bp spanning the entire Nipponbare IRGSP genome, and windows having an intersection (red boxes, b) were aligned to MH63 genome. True positive reference insertions (TP ref.) were those having full-length alignments with an MH63 region where a MITE or LTR-retrotransposon was annotated. False positives (FP) have high-quality alignments (MAQ > 30) to regions were no MITE or LTR-retrotransposon was present. True positive non-reference insertions (TP non-ref) alignments were those having a spliced alignment in which the two hits were separated by a region that overlaps with a MITE or LTR-retrotransposon annotated in MH63
Fig. 3Performance of broad-spectrum tools in the detection of reference insertions of MITEs (a), all LTR-retrotransposons (b) and full-length LTR-retrotransposons (c)
Fig. 4Performance of broad-spectrum tools in the detection of non-reference insertions of MITEs (a) and LTR-retrotransposons (b). Relocate2 on LTR-retrotransposons at 40X was killed after 5 days running with 8 CPUs and 64GB of RAM
Fig. 5Performance of family-specific tools in the detection of non-reference insertions of MITEs (a) and LTR-retrotransposons (b). Trackposon was run on 10 kb for LTR-retrotransposons windows as described in [7]
Fig. 6Venn diagrams representing the detection overlap in non-reference true positives and false positives for MITEs and LTR-retrotransposons
Fig. 7Performance of tool combinations in the detection of non-reference insertions in MITEs (a) and LTR-retrotransposons (b)
Fig. 8Performance comparison between McClintock pipeline and our proposed tool combinations for MITEs (a) and LTR-retrotransposons (b). PoPoolationTE2 and Teflon are filtered by zygosity as explained in the text (cutoffs of 0.7 and 1, respectively)
Number of insertions detected by PoPoolationTE2, Jitterbug and Teflon using a validated Drosophila melanogaster dataset
| RAL-737 | RAL-40 | RAL-801 | RAL-802 | RAL-850 | RAL-502 | RAL-508 | RAL-491 | RAL-235 | RAL-21 | TOTAL | % | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Validated insertions | 17 | 16 | 9 | 7 | 4 | 5 | 7 | 5 | 4 | 7 | 81 | |
| PoPoolationTE2 | 12 | 5 | 9 | 5 | 3 | 3 | 6 | 5 | 3 | 5 | 56 | 69,1 |
| Jitterbug | 11 | 2 | 3 | 5 | 2 | 2 | 4 | 2 | 3 | 2 | 36 | 44,4 |
| Teflon | 12 | 6 | 9 | 4 | 3 | 4 | 5 | 4 | 4 | 5 | 56 | 69,1 |
| Combination | 15 | 6 | 9 | 7 | 3 | 4 | 7 | 5 | 4 | 6 | 66 | 81,5 |
Total number of insertions detected by each tool on each line is provided in Additional file 5: Table S4
Number of insertions detected by Jitterbug, MELT and PoPoolationTE2 using a validated human dataset
| Tool | Homozygous | Heterozygous | Total |
|---|---|---|---|
| Validated insertions | 46 | 102 | 148 |
| PoPoolationTE2 | 44 (95,7%) | 90 (88,2) | 134 (90,5%) |
| Jitterbug | 22 (47,8%) | 52 (51,0%) | 74 (50,0%) |
| Teflon a | – | – | – |
| MELT | 45 (97,8%) | 84 (82,4%) | 129 (87,2%) |
| Combination | 45 (97,8%) | 94 (92,2%) | 139 (93,9%) |
a Teflon was killed after 5 days running with 12 CPU and 300GB of RAM
Total number of insertions detected: PoPoolationTE2 (ref and non-ref) = 186,038; Jitterbug (non-ref) = 624; MELT (non-ref) =1297
Fig. 9Running time of each tool to perform the detection of MITEs in a 10X dataset. Family-specific tools are marked with an asterisk. All tools were run using 8 CPUs and 64GB of RAM