| Literature DB >> 25093042 |
Cushla J Metcalfe1, Didier Casane2.
Abstract
BACKGROUND: Long interspersed nuclear elements (LINES) are the most common transposable element (TE) in almost all metazoan genomes examined. In most LINE superfamilies there are two open reading frames (ORFs), and both are required for transposition. The ORF2 is well characterized, while the structure and function of the ORF1 is less well understood. ORF1s have been classified into five types based on structural organization and the domains identified. Here we perform a large scale analysis of ORF1 domains of 448 elements from the Jockey superfamily using multiple alignments and Hidden Markov Model (HMM)-HMM comparisons.Entities:
Keywords: Long interspersed nucleotide elements; Non-long terminal retrotransposon; Open reading frame 1; Plant homeodomain; RNA recognition motif
Year: 2014 PMID: 25093042 PMCID: PMC4120745 DOI: 10.1186/1759-8753-5-19
Source DB: PubMed Journal: Mob DNA
Figure 1LINE superfamilies. Relationships between LINE superfamilies/groups and assignment of clades to superfamilies/groups based on reverse transcriptase (RT) phylogeny [7,9,10]. LINE clades were first assigned to five groups (R2, L1, RTE, I and Jockey) by Eickbush and Malik [8]. Groups are called superfamilies in the TE classification paper by Wicker et al. [7]. The ORF1 is not present in some RTE clades (shown with a dashed outline). In this paper, all Jockey superfamily/group full-length sequences from the Repbase database were assigned to three lineages based on an APE-RT phylogeny (see Figure 3). Subgroups were identified within the three lineages and ORF1 structures (see Figure 2) mapped onto the phylogeny (see Figures 4, 5 and 6).
Figure 3Full-length Jockey superfamily elements fall into three lineages: CR1, L2 and Jockey. The neighbor-joining phylogeny is based on a concatenation of the ORF2 apurinic endonuclease (APE) and reverse transcriptase (RT) domains and inferred using MEGA 6 [20] with the Jones-Taylor-Thornton (JTT) substitution matrix. Robustness of the nodes was estimated by 500 bootstrap replications. Only bootstrap values for the three lineages are shown. Lineages are delineated by alternate light and dark grey shading using FigTree [21]. Details of the ORF1 types identified in the three lineages are shown in Figures 4, 5 and 6.
Figure 2ORF1 types identified in the Jockey, CR1 and L2 lineages. Subtypes (A, B, C) are used to show the diversity of ORF1 structures within types identified in this paper. Subtype titles within a circle denote those previously described by Khazina and Weichenrieder [11] and Kapitonov et al. [16]. Lineages and subgroups were identified by ORF1 structure and phylogenetic structuring based on the apurinic endonuclease (APE) and reverse transcriptase (RT) domains (see Figures 3, 4). Clades within lineages were identified by the RTclass1 tool [9]. The phylum and species are taken from the Repbase sequence title [17]. The ORF1 structure schematic shows coding domains 5’ to the endonuclease identified in this publication and are drawn to scale. Domains not always present are shown with a dashed outline. Red: CCHC, gag-like Cys2HisCys zinc-knuckle; green: CTD, C terminal domain; yellow: coiled-coil domain; purple: esterase; pink: PHD, plant homeodomain; blue: RRM, RNA recognition motif; lilac: zf/lz, zinc finger/leucine zipper. The hatched CC, RRM + CTD domains indicate transposase 22, the RCSB Protein Data Bank entry 2yko and Pfam entry PF02994. A key to all the domains is shown in Figure 6.
Figure 4ORF1 types mapped onto the L2 lineage apurinic endonuclease (APE)-reverse transcriptase (RT) phylogeny. Please see the legend for Figure 6 for more details.
Figure 5ORF1 types mapped onto the Jockey lineage apurinic endonuclease (APE)-reverse transcriptase (RT) phylogeny.
Figure 6ORF1 types mapped onto the CR1 lineage apurinic endonuclease (APE)-reverse transcriptase (RT) phylogeny. The phylogenies are sub-trees of the Jockey superfamily APE-RT (ORF2) phylogeny (see Figure 3). Subgroups were identified based on phylogenetic clustering and the ORF1 type. These are delineated by alternate light and dark grey shading using FigTree [21] and numbered. Subgroup numbers are shown within white circle in the phylogeny and next to the ORF1 structure schematic. Subgroups were assigned to clades using the RTclass1 tool [9]. The ORF1 type/subtype (see Figure 2) and phylum in which elements were identified are shown above the schematic. The phylum is not shown if only a single sequence was identified. ORF1 domains were identified by multiple alignment of all sequences within the subgroup and an HMM-HMM comparision [22]. Coiled-coil domains were identified using Pcoils [23]. The ORF1 structure schematic shows domains 5’ to the endonuclease. The ORF1 structure is drawn to scale, domain lengths are the minimum identified. We have used the term ‘ORF1’ for simplicity’s sake, although in some cases domains are shown that are probably at the 5’ end of the ORF2 (L2 subgroup 5 and CR1 subgroup 4) or at the 5’ end of a single ORF (L2 subgroup 4 and CR1 subgroup 6). Domains are color-coded, details are shown in the key. CCHC, gag-like Cys2HisCys zinc-knuckle; CTD, C terminal domain; PHD, plant homeodomain; RRM, RNA recognition motif; zf/lz, zinc finger/leucine zipper. Transposase 22 refers to the RCSB Protein Data Bank entry 2yko_A and Pfam entry PF02994, the L1ORF1 protein composed of a coiled-coil, RRM and CTD domain [24]. Red asterisks indicate single sequences within a subgroup from a different phylum. In Figure 4 (L2 lineage) these are a single Branchiostoma floridae (Chordata) sequence in subgroup 6 and a single Capitella species (Annelida) sequence in subgroup 3. In Figure 5 (Jockey lineage) this is a single Drosophila sequence in subgroup 2.
Identification of ORF1 domains
| | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| L2_1 | 13 | 82.2 | V | No hits | | | | | |
| L2_2 | 16 | 58.2 | IC | PHD | 50 | 30.0 | 3zpv_A | 98.2 | |
| | | | | RRM | 155 | 26.0 | 2ghp_A | 80.7 | 2 |
| | | | | CCHC | 67 | 46.1 | PTHR23002 | 98.5 | 3 |
| L2_3 | 43 | 57.6 | IIA | Tnp22 | 158 | 30.6 | 2yko_A | 100.0 | 1 |
| L2_4 | 7 | 64.8 | IIIA | PHD | 51 | 43.0 | 2vpb_A | 99.7 | |
| L2_5 | 8 | 52.2 | IIB | Tnp22 | 208 | 27.7 | 2yko_A | 100.0 | 1 |
| | | | | PHD | 55 | 29.1 | 3lqh_A | 99.7 | |
| L2_6 | 23 | 55.9 | IIC | PHD | 51 | 42.5 | 2vpb_A | 96.6 | |
| | | | | Tnp22 | 209 | 35.5 | 2yko_A | 100.0 | 1 |
| L2_7 | 7 | 50.3 | IIA | Tnp22 | 191 | 21.7 | 2yko_A | 100.0 | 1 |
| L2_8 | 4 | 62.7 | IC | RRM | 188 | 34.5 | 3smz_A | 86.7 | 2 |
| | | | | CCHC | 60 | 45.3 | PTHR23002 | 99.2 | 3 |
| L2_9 | 4 | 57.1 | IVA | Esterase | 176 | 28.0 | 3p94_A | 99.9 | |
| L2_10 | 2 | 79.0 | IA | RRM | 63 | 85.9 | 2lkz_A | 90.2 | |
| | | | | CCHC | 64 | 76.9 | PTHR23002 | 98.9 | |
| Jockey_1 | 75 | 51.8 | IB | RRM | 143 | 29.1 | 2cjk_A | 96.6 | 2 |
| | | | | CCHC | 55 | 46.2 | PTHR23002 | 98.9 | 3 |
| Jockey_2 | 12 | 51.6 | IA | RRM | 74 | 32.5 | 2lxi | 93.8 | 1 |
| | | | | CCHC | 54 | 39.4 | PTHR23002 | 98.8 | 3 |
| CR1_1 | 22 | 53.5 | IIA | Tnp22 | 186 | 23.5 | 2yko_A | 98.5 | 1 |
| CR1_2 | 11 | 70.0 | V | PHD | 50 | 41.3 | 2vpb_A | 94.6 | |
| CR1_3 | 8 | 58.8 | IC | PHD | 53 | 48.1 | 2vpb_A | 96.3 | |
| | | | | CC | 34 | 22.3 | | | |
| | | | | RRM | 144 | 37.1 | 1b7f_A | 93.8 | 2 |
| | | | | CCHC | 65 | 46.8 | PTHR23002 | 98.3 | 3 |
| CR1_4 | 112 | 53.1 | IIIB | PHD | 53 | 27.6 | 3zpv_A | 95.0 | |
| | | | | RRM | 53 | 28.4 | 2dhg_A | 79.4 | 1 |
| CR1_5 | 3 | 58.1 | IIC | PHD | 50 | 71.9 | 2vpb_A | 99.4 | |
| | | | | Tnp22 | 129 | 41.8 | 2yko_A | 63.1 | 1 |
| CR1_6 | 18 | 54.0 | IIIA | PHD | 48 | 40.6 | 1wep_A | 99.4 | |
| CR1_7 | 56 | 62.9 | IVB | lz | 44 | 34.0 | 2yon_A | 85.1 | |
| | | | | zf | 44 | 34.0 | 2gmg_A | 37.1 | |
| | | | | Esterase | 174 | 43.5 | 2waa_A | 99.7 | |
| CR1_8 | 4 | 61.5 | Tnp22 | 175 | 36.5 | 2yko_A | 100.0 | ||
aLineage and subgroup identified by phylogenetic analysis based on a concatenation of the ORF2 apurinic endonuclease (APE) and reverse transcriptase (RT) domains. For further details please see the text.
bAverage percent pairwise nucleotide identity of the RT domain for each subgroup, estimated using Geneious [25].
cORF1 type (I-V) identified for each subgroup, based on ORF1 types described by Khazina and Weichenrieder [11]. Subtypes (A, B and C) are used to show the diversity of ORF1 structures within types identified in this paper.
dDomains identified within the ORF1 by HMM-HMM comparision [26] or by Pcoils [23]. CC, coiled-coil domain; CCHC, gag-like Cys2HisCys zinc-knuckle; lz, leucine zipper; PHD, plant homeodomain; RRM, RNA recognition motif; Tnp22, transposase 22, (RCSB Protein Data Bank entry 2yko_A and Pfam entry PF02994), which is the L1ORF1 protein composed of a coiled-coil, RRM and CTD domain [24]; zf, zinc finger.
eMinimum length of the inferred amino acid sequence for each domain.
fAverage percent pairwise inferred amino acid identity for each domain, estimated using Geneious [25].
gTop hits starting with ‘PTHR’ are from the Panther Classification System, all other top hits are from the RCSB Protein Data Bank.
hProbability reported by HHPred [26].
Figure 7RNA recognition motif (RRM) domains fall into six clusters. All RRM domains were clustered using CLANS with Blastp and default values [28]. Where two RRM domains were identified, the 5’ domain is labeled ‘U’ for upstream, the 3’ domain ‘D’ for downstream. Single dots are single sequences and are color-coded by subgroup. ORF2 subgroup numbers are shown in circles. Dotted lines connecting sequences represent the confidence in the Blastp hit and are colored from dark to light grey; lightest is the lowest, darkest is the highest.
Figure 8Neighbor-joining phylogeny based on ORF1 Type I domains. The ORF1 of CR1 group 3 sequences cluster with those of L2 group 2, suggesting that this type of ORF1 may have been horizontally acquired across lineages. The phylogeny was estimated using MEGA 6 [20] and inferred using the JTT substitution matrix. The robustness of the nodes was estimated by 1,000 bootstrap replicates. Only bootstrap values for major groups are shown.