| Literature DB >> 23131050 |
Sascha Steinbiss1, Sascha Kastens, Stefan Kurtz.
Abstract
BACKGROUND: Long terminal repeat (LTR) retrotransposons are a class of eukaryotic mobile elements characterized by a distinctive sequence similarity-based structure. Hence they are well suited for computational identification. Current software allows for a comprehensive genome-wide de novo detection of such elements. The obvious next step is the classification of newly detected candidates resulting in (super-)families. Such a de novo classification approach based on sequence-based clustering of transposon features has been proposed before, resulting in a preliminary assignment of candidates to families as a basis for subsequent manual refinement. However, such a classification workflow is typically split across a heterogeneous set of glue scripts and generic software (for example, spreadsheets), making it tedious for a human expert to inspect, curate and export the putative families produced by the workflow.Entities:
Year: 2012 PMID: 23131050 PMCID: PMC3582472 DOI: 10.1186/1759-8753-3-18
Source DB: PubMed Journal: Mob DNA
Figure 1Structure of a typical long terminal repeat retrotransposon. Adapted from [5]. AP:Aspartic protease; IN: integrase; LTR: long terminal repeat; PPT: polypurine tract; RH: RNase H; RT: reverse transcriptase; TSD: target site duplication. The numbers below the illustration denote typical lengths of the respective component. This example is of the copia-like superfamily, as it shows the IN-RT-RH domain order.
Figure 2Screenshot of the main window. (a) Putative family list, (b) candidate list, (c) candidate details, (d) candidate visualization. The currently loaded project contains 13,943 candidates from the Monodelphis domestica genome, with the currently selected candidate showing a full set of detected features (PBS, PPT, protein domains). ORF detection and reference matching have been performed. Additional details, such as PPT and PBS sequences, Pfam IDs and so on, are available by scrolling to the right in (c). The graphical representation in (d) depicts the retrotransposon (red) with PPT and PBS as small lines in the two tracks below. The next track shows protein domain matches, coded in different colors. Here integrase domains are depicted in blue, reverse transcriptase domains in red, protease domains in purple, RNase H domains in gold, and any other domains in green. The RNase H domain is marked in red because it has been selected in the candidate detail list. The reference match in the track below (shown in yellow) spans the interior region of the candidate completely, suggesting that it likely is a full-length element. The bottom track shows open reading frames in blue color. LTR: long terminal repeat; ORF: open reading frame; PBS: primer binding site; PPT: polypurine tract; TSD: target site duplication.
Figure 3Screenshot of the filter selection dialog. The left side of the dialog shows the filtering rules added to the project and available to be used. The right side shows the filtering rules to be applied in the current filtering run. The checkbox next to each rule allows the user to negate it. This dialog is set to unclassify or delete all candidates not passing the filtering step - in this case, this means all candidates that do not contain protein domains and no long reference matches. The buttons on the left allow adding rules to the project and removing them again. Moreover rules can be edited directly from within LTRsift in a simple built-in text editor, avoiding the need to locate and open them in a separate text editor. Clicking the button on the lower left starts the filtering process.
Figure 4Source code (in the programming language Lua) of the filtering rule for selecting/filtering candidates according to reference match coverage. The function computes the lengths of the candidate and reference matches contained in the candidate. If the length of at least one reference match exceeds 80% of the candidate length, the function returns false, otherwise true.
Results for the use case
| dmel_1 | mdg3 | 12 |
| dmel_3 | opus | 18 |
| dmel_4 | copia | 24 |
| dmel_5 | springer | 6 |
| dmel_6 | Burdock | 16 |
| dmel_7 | diver | 8 |
| dmel_8 | HMS-Beagle | 10 |
| dmel_9 | Tirant | 19 |
| dmel_10 | Tabor | 4 |
| dmel_11 | Quasimodo | 14 |
| dmel_12 | Transpac | 9 |
| dmel_14 | flea | 16 |
| dmel_17 | invader2 | 8 |
| dmel_17_2 | invader3 | 7 |
| dmel_18 | Max-element | 4 |
| dmel_22 | 3S18 | 6 |
| dmel_24 | McClintock | 4 |
| dmel_32 | 17.6 | 18 |
| dmel_33 | 412 | 17 |
| dmel_34 | 412 | 9 |
| dmel_36 | Idefix | 5 |
| dmel_39 | rover | 5 |
| dmel_46 | micropia | 3 |
| newfam_0 | blood | 25 |
| newfam_29 | HMS-Beagle2 | 4 |
| newfam_40 | 297 | 20 |
| manual | gypsy4 | 7 |
| manual | mdg1 | 17 |
| manual | roo | 94 |
This table lists the putative families as assigned during our semi-automatic evaluation run on the Drosophila melanogaster genome (left column). The center column shows the name of the known family represented by that putative family, obtained from matching of the candidate sequence against a reference sequence set. The rightmost column lists the number of candidates in the respective family. The dmel_26 group (matched to various Stalker sequences) was not counted as recovered due to the multitude of non-unique matches to multiple references. Families with the newfam prefix were obtained by re-running the classification algorithm on subsets of the unclassified candidate set. Finally, families marked as manual were derived non-automatically.