| Literature DB >> 25408240 |
Arthur Gilly1,2,3,4, Mathilde Etcheverry5,6,7, Mohammed-Amin Madoui8,9,10, Julie Guy11,12,13, Leandro Quadrana14,15,16, Adriana Alberti17,18,19, Antoine Martin20,21,22,23, Tony Heitkam24,25,26,27, Stefan Engelen28,29,30, Karine Labadie31,32,33, Jeremie Le Pen34,35,36,37, Patrick Wincker38,39,40, Vincent Colot41,42,43, Jean-Marc Aury44,45,46.
Abstract
BACKGROUND: Transposable elements (TEs) are DNA sequences that are able to move from their location in the genome by cutting or copying themselves to another locus. As such, they are increasingly recognized as impacting all aspects of genome function. With the dramatic reduction in cost of DNA sequencing, it is now possible to resequence whole genomes in order to systematically characterize novel TE mobilization in a particular individual. However, this task is made difficult by the inherently repetitive nature of TE sequences, which in some eukaryotes compose over half of the genome sequence. Currently, only a few software tools dedicated to the detection of TE mobilization using next-generation-sequencing are described in the literature. They often target specific TEs for which annotation is available, and are only able to identify families of closely related TEs, rather than individual elements.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25408240 PMCID: PMC4279814 DOI: 10.1186/s12859-014-0377-z
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1TE-Tracker overview, main steps of the TE-Tracker pipeline.
Figure 2TE-Tracker main algorithms. a. Discordant pairs around insertion breakpoint. Sequenced reads around a newly inserted TE-copy (top half) produce discordant read mappings when aligned onto the reference sequence where the newly inserted copy only exists at the locus of origin (bottom half). The thin black line represents the sequenced DNA fragment, the thick black line represents a transposon of interest. Yellow and orange arrows represent the left and right extremities of the insertion breakpoint, linked arrows represent paired-end reads. Grey reads will be normally mapped, while colored reads will be mapped discordantly, the color indicates a type of discordance (left mate on the acceptor and right on the donor and vice-versa). b. Clustering of discordant pairs. Discordant reads of the same type are isolated and sorted (left half). Both ends must be sufficiently close for two read pairs to be clustered together, but sorting of the left end, combined with a random insert size results in different thresholds for clustering both ends. Pairs are clustered according to the Single-Linkage method (see Methods), which represent read pairs as edges on a graph (right half). A point is added to a cluster if its distance to any other point already in the graph meets both thresholds when projected on both axes. c. Cluster merging. Local drops in read coverage break clusters, corrupting insertion signals. A proximity threshold is applied to merge neighboring clusters of the same type and orientation. Local coverage is represented by a grey curve on top of the sequence, while linked colored arrows represent clusters of read pairs. d. Calling. The four types of transposition events detected by TE-Tracker along with their associated cluster signatures, with an emphasis on the overlap condition used to assemble clusters with compatible signatures into bona fide events.
Figure 3Illustration of the donor-scoring algorithm. In this example we describe an event involving a TE copy that differs by only one base pair from another TE in the same family. Because multiple mappings are considered, most of the discordant reads anchored around the insertion locus will map on both candidate donors equally well (plain blue and plain red reads), which will result in TE-Tracker reporting both of them. However a fraction of the discordant reads (blue reads with red mark) will span the one divergent position that differentiates both copies. These reads will map on both locations as well, but their mapping quality score will be significantly higher on the true donor copy. Counting such reads for each donor allows TE-Tracker to quickly determine a “specificity score” for each candidate, therefore helping to determinate the probable true origin of the transposition event. For simplicity, only the multiple mappings of discordant pairs were represented on this figure.
Comparison of the features, algorithms and input formats of common software used to detect mobilization of TE and/or structural variations
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| |
| RetroSeq | TE-dedicated | BAM file, TE annotation or sequence | ✓ | ✓ | ✗ | ✗ | 100 bp-1 Kbp | ✓ |
| Tea | TE-dedicated | BAM file, TE annotation or sequence | ✗ | ✓ | ✗ | ✗ | 1 bp | ✓ |
| T-lex | TE-dedicated | FASTQ file, TE annotation | ✗ | ✓ | ✗ | ✗ | 1 bp | ✓ |
| Popoolation-TE | TE-dedicated | FASTQ file, TE sequence and TSD annotation | ✗ | ✓ | ✗ | ✗ | 1 bp | ✓ |
| TE-locate | TE-dedicated | FASTQ file, TE sequence | ✗ | ✓ | ✗ | ✗ | 1 bp | ✓ |
| ngs_te_mapper | TE-dedicated | FASTQ file, TE annotation | ✗ | ✓ | ✓ | ✗ | 1 bp | ✓ |
| RelocatTE | TE-dedicated | FASTQ file, TE sequence | ✗ | ✓ | ✓ | ✗ | 1 bp | ✓ |
| TIF | TE-dedicated | FASTQ file, TE sequence and TSD annotation | ✗ | ✓ | ✗ | ✗ | 1 bp | ✓ |
| VariationHunter | SV | DIVET alignment file (mrFAST output) | ✗ | ✗ | ✓ | ✓ | 100 bp-1 Kbp | ✓ |
| PRISM | SV | BAM file | ✓ | ✗ | ✗ | ✓ | 1 bp | ✗ |
| Delly | SV | BAM file | ✓ | ✗ | ✗ | ✓ | 1 bp | ✗ |
| GASVpro | SV | Alignment file and coverage data file | ✓ | ✗ | ✓ | ✓ | 100-1 kp | ✗ |
| Hydra | SV | Discordant reads coordinates and mapping features | ✗ | ✗ | ✓ | ✓ | 1 bp | ✗ |
Software performance evaluated using simulated transposition events in the Arabidopsis genome
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| Input data§ | PE reads | MP reads | PE reads | MP reads | PE reads | PE reads |
| Filter | None | >10 supporting pairs | >2 supporting pairs | >10 supporting pairs | >2 supporting pairs | >10 supporting pairs |
| Filtered predictions | 190 | 351 | 795 | 10,366 | 6,448 | 26 |
| FP | 20 | 82 | 564 | 10,017 | 6,358 | 20 |
| # Insertion found† | 146 | 260 |
| 139 | 247 | 6 |
| # Insertion† + correct donor found | 128 |
| 214 | 139 | 225 | 0 |
| Positive predictive value (PPV) | 67.3% |
| 26.9% | 1.3% | 3.5% | 0% |
| Sensitivity | 42.6% |
| 71.3% | 46.3% | 75% | 0% |
[ † ] Insertion found at +/− 300 bp.
[§] Paired-end (PE) reads were generated using ART and mate-pair (MP) reads were generated using SimSeqG. If programs can deal with both types of input data, we chose to report only the results obtained from the sequencing protocol that led to the best metrics.
A transposition event is qualified as « found » when at least one line in the output file has either one or the other side of a cluster overlapping the insertion site (for TE-Tracker, only the acceptor site is considered); A transposition event is qualified as « found with donor » when at least one line in the output file spans both the origin and destination sequence (for TE-Tracker the acceptor/donor nature of the site is taken into account). Even when the correct donor is identified for an insertion locus, other possible donors are often reported due to sequence similarity. For TE-Tracker, we display the number of cases where the donor-scoring feature distinguishes the real donor from all reported ones in parentheses. This feature is unique to TE-Tracker. The best detection statistic is displayed in bold in relevant rows.
TE-dedicated software evaluation
|
|
|
|
|
|
|
|---|---|---|---|---|---|
| RetroSeq | 128 (43%) | 87 (87%) | 0 (0%) | 0 (0%) | 41 (82%) |
| TE-Tracker | 257 (86%) | 91 (91%) | 81 (81%) | 42 (84%) | 43 (86%) |
[ † ] Insertion found at +/− 300 bp.
Sequencing and alignment properties
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|
| 439 | 127,172,830 | 27.6 | 3.6 | 1.5 | 1.6 | 9.0 | 53.7 | 22.3 | 4,900 |
| MEJ07 | 92,937,978 | 35.2 | 5.3 | 1.4 | 1.4 | 11.8 | 41.9 | 20.8 | 4,900 |
| 60 | 85,525,387 | 36.0 | 5.2 | 1.4 | 1.4 | 11.2 | 41.5 | 19.5 | 5,200 |
| 454 | 92,352,477 | 19.0 | 3.1 | 1.1 | 1.1 | 9.0 | 63.8 | 11.2 | 5,300 |
| 55 | 71,487,300 | 35.9 | 8.0 | 1.8 | 1.8 | 9.4 | 39.9 | 16.3 | 5,300 |
Figure 4Circos representation of new TE insertion events detected in four epiRILs. Exterior circle represents the five chromosomes of Arabidopsis with pericentromeric regions and heterochromatic knob on chromosome 4 in dark grey. Arrows link donor TEs with the new insertion sites. Only events mapped with no ambiguity (no multiple acceptor sites and no similarity with events detected in wt) are represented.
Figure 5Gbrowse view of composite elements detected by TE-Tracker. Red dotted lines indicate the boundaries of the mobile sequence as detected by TE-Tracker. a. Element composed of two sequences annotated as ATENSPM3 and one annotated as HELITRON2. b. Element composed of two sequences annotated as ATENSPM3 and one annotated as ATLANTYS1.