| Literature DB >> 31159720 |
Joseph D Valencia1, Hani Z Girgis2.
Abstract
BACKGROUND: Long terminal repeat retrotransposons are the most abundant transposons in plants. They play important roles in alternative splicing, recombination, gene regulation, and defense mechanisms. Large-scale sequencing projects for plant genomes are currently underway. Software tools are important for annotating long terminal repeat retrotransposons in these newly available genomes. However, the available tools are not very sensitive to known elements and perform inconsistently on different genomes. Some are hard to install or obsolete. They may struggle to process large plant genomes. None can be executed in parallel out of the box and very few have features to support visual review of new elements. To overcome these limitations, we developed LtrDetector, which uses techniques inspired by signal-processing.Entities:
Keywords: Long terminal repeats retrotransposons; Repeats; Signal processing; Software
Mesh:
Substances:
Year: 2019 PMID: 31159720 PMCID: PMC6547461 DOI: 10.1186/s12864-019-5796-9
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Method overview: LtrDetector is a software tool for locating long terminal repeat (LTR) retrotransposons (RTs). a A sequence of scores reflects the distance to the closest exact copy of the k-mer starting at each nucleotide. b Smoothed scores are produced after adjacent spikes are merged into a contiguous region. c Plateau regions are identified. Separate plateaus here are represented by black and red lines. d Plateaus are paired and their boundaries are adjusted. The red triangles denote the start and end coordinates for each LTR
Fig. 2(a) Contiguous stretches of the same non-zero score are identified and marked as keep (K) or delete (D). (b) The forward pass merges K sections toward each other and adjacent D sections. (c) The backward pass merges remaining D sections that are close to K sections
Length statistics to determine the default parameters of LtrDetector
|
|
|
|
|
|
| ||
|---|---|---|---|---|---|---|---|
| LTR length | Maximum | 2033 | 5832 | 2886 | 5645 |
| 5609 |
| Minimum | 103 | 109 | 105 | 100 |
| 154 | |
| Mean | 495 | 1128 | 421 | 731 | 679 | 1840 | |
| Standard deviation | 436 | 1399 | 356 | 942 | 936 | 1674 | |
| Total length | Maximum | 14069 | 20595 | 18868 |
| 20354 | 16260 |
| Minimum | 824 |
| 2565 | 764 | 536 | 5143 | |
| Mean | 5635 | 5982 | 5516 | 6697 | 6331 | 9449 | |
| Standard deviation | 2083 | 2995 | 1998 | 2979 | 2692 | 3274 |
These statistics were calculated on LTR-RTs of six plant genomes found in Repbase. In Repbase, an LTR-RT is reported as two sequences: the LTR sequence and the interior sequence. Default length parameters were chosen to approximate the most extreme values found in the dataset, which appear in boldface. To calculate the total length of an LTR-RT, we concatenated two LTR sequences to the two sides of its interior sequence
Fig. 3The effect of different values of k — the size of the short words, which are used as the keys in the hash table — on the F1 measure. As the value of k increases from 9 to 11 or 12, the F1 value increases (the higher, the better). The performance does not change markedly after that. (a) Shows the experiment on A. thaliana, (b) shows O. sativa
Results on the X Chromosome of D. melanogaster: We evaluated four de-novo tools on a ground-truth annotation provided by Lerat [1]
| Total | TP | Sensitivity | Memory | Time | |
|---|---|---|---|---|---|
| Tool | Of 96 | % | MB | sec. | |
| LTR_Finder | 57 | 48 | 50.0 | 300.3 | 390 |
| LTR_seq | 204 | 48 | 50.0 | 874.4 | 7262 |
| LTRharvest | 200 | 93 | 96.9 | 190.5 | 23 |
| LtrDetector | 160 | 92 | 95.3 | 1209.3 | 15 |
Total is the number of proposed LTR-RTs, TP stands for true positives
Results on synthetic genomes: We constructed several synthetic chromsomes with randomly generated direct repeats mutated at a given percentage of nucleotides (0–30%) to assess performance at different levels of LTR conservation
| Tool | Total | TP | GT |
|---|---|---|---|
|
| |||
| LTRharvest | 90 | 90 | 90 |
| LtrDetector | 90 | 90 | 90 |
|
| |||
| LTRharvest | 92 | 91 | 92 |
| LtrDetector | 90 | 90 | 92 |
|
| |||
| LTRharvest | 78 | 74 | 91 |
| LtrDetector | 88 | 88 | 91 |
|
| |||
| LTRharvest | 32 | 29 | 92 |
| LtrDetector | 74 | 74 | 92 |
|
| |||
| LTRharvest | 10 | 8 | 93 |
| LtrDetector | 40 | 40 | 93 |
|
| |||
| LTRharvest | 1 | 0 | 90 |
| LtrDetector | 2 | 2 | 90 |
Total is the number of proposed LTR-RTs, TP is number of true positives, GT is number of elements in the synthetic ground truth
Results on six plant genomes: We tested three tools on one model organism, A. thaliana, and five important crops of varying genomic size and repeat content
| Tool | Total | TP | GT | FP | Sensitivity | Precision | F1 | Time (hr:min:sec) | Memory (GB) |
|---|---|---|---|---|---|---|---|---|---|
|
| |||||||||
| LTR_Finder | 399 | 106 | 248 | 0 | 0.427 | 1.000 | 0.599 | 0:30:46 | 0.86 |
| LTRharvest | 2301 | 180 | 248 | 6 | 0.726 | 0.968 | 0.829 | 0:01:08 | 0.24 |
| LtrDetector | 1714 | 187 | 248 | 9 | 0.754 | 0.954 | 0.842 | 0:04:02 | 4.45 |
|
| |||||||||
| LTR_Finder | 5324 | 1163 | 1760 | 14 | 0.661 | 0.988 | 0.792 | 5:19:03 | 0.95 |
| LTRharvest | 9761 | 1392 | 1760 | 182 | 0.791 | 0.884 | 0.835 | 0:03:09 | 0.34 |
| LtrDetector | 7343 | 1442 | 1760 | 119 | 0.819 | 0.924 | 0.868 | 0:15:30 | 5.31 |
|
| |||||||||
| LTR_Finder | 11734 | 4219 | 6565 | 67 | 0.643 | 0.984 | 0.778 | 10:43:26 | 1.62 |
| LTRharvest | 22700 | 4476 | 6565 | 502 | 0.682 | 0.899 | 0.776 | 0:04:23 | 0.60 |
| LtrDetector | 24682 | 5285 | 6565 | 214 | 0.805 | 0.961 | 0.876 | 1:10:47 | 6.1 |
|
| |||||||||
| LTR_Finder | 12141 | 1748 | 3130 | 7 | 0.558 | 0.996 | 0.716 | 25:18:21 | 1.88 |
| LTRharvest | 29016 | 2171 | 3130 | 20 | 0.694 | 0.991 | 0.816 | 0:09:35 | 0.48 |
| LtrDetector | 25537 | 2542 | 3130 | 12 | 0.812 | 0.995 | 0.894 | 0:43:06 | 6.11 |
|
| |||||||||
| LTR_Finder | 60860 | 11411 | 16839 | 13 | 0.678 | 0.999 | 0.807 | 111:21:37 | 12.36 |
| LTRharvest | 101943 | 11244 | 16839 | 102 | 0.668 | 0.991 | 0.798 | 0:15:00 | 2.36 |
| LtrDetector | 116923 | 13122 | 16839 | 71 | 0.779 | 0.995 | 0.874 | 5:53:08 | 9.62 |
| H. vulgare | |||||||||
| LTR_Finder | – | – | – | – | – | – | – | – | – |
| LTRharvest | 207016 | 4378 | 9164 | 492 | 0.478 | 0.899 | 0.624 | 1:33:29 * | 5.12 |
| LtrDetector | 213367 | 6824 | 9164 | 199 | 0.745 | 0.972 | 0.843 | 17:24:04** | 14.15 |
| Total (Excluding | |||||||||
| LTR_Finder | 90458 | 18647 | 28542 | 101 | 0.653 | 0.995 | 0.789 | 153:13:13 | – |
| LTRharvest | 165721 | 19463 | 28542 | 812 | 0.682 | 0.960 | 0.797 | 0:33:15 | – |
| LtrDetector | 176197 | 22578 | 28542 | 425 | 0.791 | 0.982 | 0.876 | 8:06:33 | – |
| Total (Including | |||||||||
| LTRharvest | 372737 | 23841 | 37706 | 1304 | 0.632 | 0.948 | 0.759 | 02:06:44 | – |
| LtrDetector | 389564 | 29402 | 37706 | 624 | 0.780 | 0.979 | 0.868 | 25:30:37 | – |
Parameters used for each tool can be found in the “Implementation” section. We used an additional utility to process each of LTR_Finder and LTRharvest in parallel because neither supports multi-threading. We did so to ensure fair comparison in terms of time since our tool, LtrDetector, is concurrent by default. Total is the number of proposed LTR-RTs, TP is number of true positives, GT is number of elements in the ground truth, FP are false positives. Sensitivity, Precision, and F1 are defined by Eqs. 1, 2, and 3. We report all measures for each genome and in total. Note: Results for LTR_Finder are unavailable for the Hordeum vulgare (barley) genome because memory demands repeatedly caused the computer to crash on four computer cores, and a subsequent trial on one core was unable to finish over two weeks of run time (2/7 chromosomes finished). All trials run on four cores unless otherwise noted. * LTRharvest run on one thread for H. vulgare. ** LtrDetector run on three threads for H. vulgare
Gene content validation: We searched for species-specific fused gag/pol in the interior of the known and the predicted LTR-RTs
|
|
|
|
| |
|---|---|---|---|---|
| Ground truth | 0.65 | 0.43 | 0.51 | 0.10 |
| LtrDetector | 0.48 | 0.31 | 0.25 | 0.10 |
| LTR_Finder | 0.49 | 0.28 | 0.20 | 0.11 |
| LTRharvest | 0.35 | 0.21 | 0.18 | 0.09 |