| Literature DB >> 18194517 |
David Ellinghaus1, Stefan Kurtz, Ute Willhoeft.
Abstract
BACKGROUND: Transposable elements are abundant in eukaryotic genomes and it is believed that they have a significant impact on the evolution of gene and chromosome structure. While there are several completed eukaryotic genome projects, there are only few high quality genome wide annotations of transposable elements. Therefore, there is a considerable demand for computational identification of transposable elements. LTR retrotransposons, an important subclass of transposable elements, are well suited for computational identification, as they contain long terminal repeats (LTRs).Entities:
Mesh:
Substances:
Year: 2008 PMID: 18194517 PMCID: PMC2253517 DOI: 10.1186/1471-2105-9-18
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Structure of a typical LTR retrotransposon/retrovirus (adapted from [16]). LTR = long terminal repeat, TSD = target site duplication, PBS = primer binding site, PPT = poly purine tract, gag, pol, env = open reading frames for LTR retrotransposon genes.
Figure 2Flowchart of the LTR retrotransposon prediction process in the program . The optional clustering process used in some benchmark tests is not shown in this flowchart.
Parameter sets for LTRharvest used for predictions in genomes of S. cerevisiae and D. melanogaster
| seed | 30 | 100 | 76 | exact match length requirement for 5'-3' LTR pair |
| minlenltr | 100 | 100 | 116 | length constraint for 5'-3' LTR pair, in bp |
| maxlenltr | 1000 | 1000 | 800 | length constraint for 5'-3' LTR pair, in bp |
| mindistltr | 1000 | 1500 | 2280 | distance constraint for 5'-3' LTR pair, in bp |
| maxdistltr | 15000 | 15000 | 8773 | distance constraint for 5'-3' LTR pair, in bp |
| similar | 85% | 80.0% | 91.0% | similarity threshold of a 5'-3' LTR pair, in bp |
| xdrop | 5 | 5 | 7 | Xdrop score for extending seed |
| mat | 2 | 2 | 2 | match score |
| mis | -2 | -2 | -2 | mismatch score |
| ins | -3 | -3 | -3 | insertion score |
| del | -3 | -3 | -3 | deletion score |
| mintsd | 4 | 5 | 4 | minimal length of a target site duplication (TSD), in bp |
| maxtsd | 20 | 20 | 20 | maximal length of a target site duplication (TSD), in bp |
| motif | - | tg...ca | - | required motif |
| motifmis | 0 | 0 | 0 | maximum number of mismatches in motif |
| vic | 60 | 60 | 60 | number of nucleotide positions (vicinity) to the left and to the right, respectively, for searching TSD and motif around boundaries |
| overlaps | best | best | best | strategy for handling predicted LTR elements which overlap |
column 'value S. cer.' = setting for S. cerevisiae
column 'value D. mel.' = setting for D. melanogaster
For a detailed description of parameters, see the manual of LTRharvest [33].
Quality validation of running LTRharvest on the D. melanogaster genome sequences (release 3)
| 2L | 91 | 66 | 64 | 1 | 26 | 1 |
| 2R | 89 | 54 | 49 | 3 | 37 | 2 |
| 3L | 96 | 59 | 51 | 7 | 38 | 1 |
| 3R | 96 | 67 | 60 | 6 | 30 | 1 |
| 4 | 11 | 4 | 4 | 0 | 7 | 0 |
| X | 122 | 54 | 51 | 3 | 68 | 0 |
The used options are listed in Table 1 in column 'value D. mel.'. Column 'Predictions' gives the number of predicted and column 'References' the number of annotated 'full length' LTR retrotransposons for each chromosome. A prediction is classified as true positive (TP), if the maximum allowed distance between 5' and 3' coordinates of the prediction and the reference is at most 20 nucleotides. If only one of the two predicted boundary coordinates lies within the allowed distance of 20 nucleotides, the prediction is categorised as a half true positive (hTP). All other predictions are labelled false positive (FP). LTR retrotransposons that are missing in the prediction are labelled false negative (FN).
List of clusters of LTRharvest predictions compared to the Drosophila melanogaster annotation
| LTRharvest_Dmel0 | 92 | Roo | 58 | 146 |
| LTRharvest_Dmel1 | 18 | opus | 16 | 24 |
| LTRharvest_Dmel2 | 15 | mdg1 | 13 | 25 |
| LTRharvest_Dmel3 | 2 | McClintock | 2 | 2 |
| LTRharvest_Dmel4 | 22 | blood | 22 | 22 |
| LTRharvest_Dmel5 | 26 | 412 | 24 | 31 |
| LTRharvest_Dmel6 | 50 | 297,17.6 | 25 | 69 |
| LTRharvest_Dmel7 | 12 | Stalker, Stalker2, Stalker4 | 9 | 27 |
| LTRharvest_Dmel8 | 3 | rover | 3 | 6 |
| LTRharvest_Dmel9 | 15 | micropia, DM88, invader1 | 3 | 63 |
| LTRharvest_Dmel10 | 3 | invader3 | 3 | 16 |
| LTRharvest_Dmel11 | 19 | Tirant | 15 | 20 |
| LTRharvest_Dmel12 | 28 | copia | 26 | 30 |
| LTRharvest_Dmel13 | 9 | diver | 9 | 9 |
| LTRharvest_Dmel14 | 6 | Quasimodo | 5 | 14 |
| LTRharvest_Dmel15 | 5 | Transpac | 5 | 5 |
| LTRharvest_Dmel16 | 3 | Idefix | 2 | 7 |
| LTRharvest_Dmel17 | 12 | Burdock | 7 | 13 |
| LTRharvest_Dmel18 | 3 | |||
| LTRharvest_Dmel19 | 16 | blastopia | 13 | 17 |
| LTRharvest_Dmel20 | 6 | springer | 5 | 11 |
| LTRharvest_Dmel21 | 12 | HMS-Beagle | 9 | 13 |
| LTRharvest_Dmel22 | 3 | |||
| LTRharvest_Dmel23 | 5 | |||
| LTRharvest_Dmel24 | 3 | GATE | 0 | 20 |
| LTRharvest_Dmel25 | 8 | mdg3 | 8 | 16 |
| LTRharvest_Dmel26 | 4 | invader2 | 3 | 10 |
| LTRharvest_Dmel27 | 2 | |||
| LTRharvest_Dmel28 | 4 | 3S18 | 4 | 6 |
| LTRharvest_Dmel29 | 2 | gypsy5 | 1 | 2 |
| LTRharvest_Dmel30 | 3 | |||
| LTRharvest_Dmel31 | 2 | |||
| LTRharvest_Dmel32 | 2 | gypsy, gtwin | 3 | 8 |
| LTRharvest_Dmel33 | 2 | invader4 | 2 | 9 |
| LTRharvest_Dmel34 | 2 | Tabor | 2 | 3 |
| LTRharvest_Dmel35 | 2 | |||
For clustering the program Vmatch was used with the following options: seedlength for minimal length of exact repeats = 50, minimal length of matches = 2500, Xdrop = 9 and matching conditions (dbcluster) that cover at least 80% of the smaller sequence and 28% of the larger sequence.
Of 49 annotated families [26], this Table lists 34 out of 41 families containing at least one full length member and one family (GATE) out of 8 families that are entirely composed of partial elements.
Quality validation of programs for LTR retrotransposon prediction on the genome of S. cerevisiae
| Run-number | 1 | 2 | 3 | 4-1 | 5 | 6 | 7 | 8 | 9 |
| Parameter set | default* | default | default | default | default | see Tab.1 | see Tab.1 | see Tab.1 | see Tab.1 |
| Index files contruction [s] | - | - | - | - | 8 | - | - | - | 8 |
| Run-time [s] | ~600 | 413 | 190 | 19 | 3 | 126 | 168 | 19 | 2 |
| Annotations | 50 | 50 | 47 | 50 | 50 | 50 | 46 | 50 | 50 |
| Predictions | 39 | 50 | 46 | 56 | 68 | 38 | 38 | 43 | 45 |
| Sensitivity | 76% | 80.0% | 89.4% | 100% | 98.0% | 74.0% | 69.6% | 84.0% | 90.0% |
| Specificity | 97.4% | 100.0% | 91.3% | 89.3% | 72.1% | 97.4% | 84.2% | 97.7% | 100% |
| Comment | program error chr03,06 | program error chr03, 06,09 |
* = parameters are not adjustable.
Details on parameter settings and exact numbers of predictions are shown in Table B of the Additional file 1. Sensitivity and specificity values were calculated as outlined in the S. cerevisiae benchmark counting all TPs and hTPs as true positives. In run-no. 2 and no. 7, the program LTR_Rho reported an error for some chromosomal sequences. Thus the number of annotations and predictions was adjusted to the incomplete data set.
Quality validation of programs for LTR retrotransposon prediction on the genome of D. melanogaster
| Run-number | 1 | 2 | 3 | 4-1 | 5-1 | 5-2 | 6 | 7 | 8 | 9 |
| Parameter set | default* | default | default | default | default | default + clustering | see Tab. 1 | see Tab.1 | see Tab. 1 | see Tab.1 + clustering |
| Index files contruction [s] | - | - | - | - | 138 | 138 | - | - | - | 138 |
| Run-time [s] | 4380 | 24120 | 2286 | 1209 | 25 | 198** | 1380 | 1709 | 320 | 170** |
| 304 | 304 | 304 | 304 | 304 | 304 | 304 | 304 | 304 | 304 | |
| Predictions | 310 | 188 | 417 | 395 | 723 | 490 | 160 | 398 | 204 | 411 |
| Sensitivity | 37.5% | 36.8% | 94.7% | 74.3% | 94.7% | 97.4% | 35.2% | 96.1% | 52.0% | 97.7% |
| Specificity | 36.8% | 59.6% | 69.1% | 57.2% | 40.4% | 60.4% | 66.9% | 73.4% | 77.5% | 72.3% |
| 682 | 682 | 682 | 682 | 682 | 682 | 682 | 682 | 682 | 682 | |
| Predictions | 310 | 188 | 417 | 395 | 723 | 490 | 160 | 398 | 204 | 411 |
| Sensitivity | 20.1% | 22.0% | 54.0% | 45.9% | 58.9% | 57.9% | 19.8% | 53.5% | 28.7% | 56.3% |
| Specificity | 44.2% | 79.8% | 88.2% | 90.3% | 55.6% | 80.6% | 84.4% | 91.7% | 96.1% | 93.4% |
* = parameters are not adjustable.
** = run-time LTRharvest + clustering with Vmatch.
Details on parameter settings and exact numbers of predictions are shown in Table C of the Additional file 1. Sensitivity and specificity values were calculated against the D. melanogaster annotation counting all TPs and hTPs as true positives.