Literature DB >> 26442169

Mining hidden polymorphic sequence motifs from divergent plant helitrons.

Abstract

As a major driving force of genome evolution, transposons have been deviating from their original connotation as "junk" DNA ever since their important roles were revealed. The recently discovered Helitron transposons have been investigated in diverse eukaryotic genomes because of their remarkable gene-capture ability and other features that are crucial to our current understanding of genome dynamics. Helitrons are not canonical transposons in that they do not end in inverted repeats or create target site duplications, which makes them difficult to identify. Previous methods mainly rely on sequence alignment of conserved Helitron termini or manual curation. The abundance of Helitrons in genomes is still underestimated. We developed an automated and generalized tool, HelitronScanner, that identified a plethora of divergent Helitrons in many plant genomes. A local combinational variable approach as the key component of HelitronScanner offers a more granular representation of conserved nucleotide combinations and therefore is more sensitive in finding divergent Helitrons. This commentary provides an in-depth view of the local combinational variable approach and its association with Helitron sequence patterns. Analysis of Helitron terminal sequences shows that the local combinational variable approach is an efficacious representation of nucleotide patterns imperceptible at a full-sequence level.

Entities: Chemical Disease Species

Keywords: Helitron; algorithm; bioinformatic analysis; local combinational variable; sequence pattern

Year: 2014 PMID： 26442169 PMCID： PMC4588551 DOI： 10.4161/21592543.2014.971635

Source DB: PubMed Journal: Mob Genet Elements ISSN： 2159-2543

local combinational variable position-specific scoring matrix

Transposable elements jump around and reshape genomes through the action of transposases either encoded by themselves or other transposons from the same family. Transposons in one family share the transposase and transposition mechanism and homologous terminal/subterminal sequences. As a special kind of transposon, Helitrons have been widely studied in a broad range of eukaryotes because of their remarkable ability to capture genes and regulatory elements. Helitrons presumably transpose by a rolling-circle mechanism because putative autonomous Helitrons encode proteins containing 3 conserved functional motifs that are known to be involved in bacterial and phage rolling-circle replication. However, unlike other DNA transposons, Helitrons do not possess terminal inverted repeats or create target site duplications, which likely delayed their discovery and hindered subsequent large-scale automated annotation. Even though Helitrons are reported broadly in diverse genomes, the number of Helitrons is probably still underestimated due to their lack of canonical transposon structures. Methods to identify Helitrons are based on homology of a RepHel protein and terminal sequences. Helitrons are deemed putatively autonomous if they encode intact RepHel proteins, and non-autonomous if they lack the transposase. Autonomous ones are scarce, so automated Helitron identification tools mainly focus on homology of short sequences at the Helitron termini. Our previous HelitronFinder tool looks for Helitron hallmarks, including AT dinucleotide insertion site, 5′-TC, CTAG-3′, and a conserved 16- to 20-bp palindromic structure located 10–15 bp away from the 3′ termini. HelSearch, another structure-based tool, first detects 3′ hairpins, retains those with multiple copies in the host genome, and manually extends toward 5′ ends to determine Helitron 5′ boundaries. HelitronFinder works optimally with maize but is hard to extend to other species, while HelSearch does not appear to have species limitations but requires manual inspection to identity 5′ ends. Both methods identified approximately 3 thousand Helitrons in maize with a 95% overlap, but neither was able to detect a highly abundant ∼1-kb Helitron named Cornucopious, with thousands of copies in maize genome, that had been identified earlier from a vertical comparison of allelic haplotypes. This failure was caused by Cornucopious having more divergent 3′ ends than previously known Helitrons. Another work combining BLAST search and hidden Markov models identified many Helitrons in the rice genome, but seemed not applicable to maize. A model-based method searched for new Helitron termini by BLASTing known Helitron terminal consensus sequences and identified a number of Helitrons in Arabidopsis thaliana. This method brought more flexibility than searching for whole homologous Helitrons, but was still limited to Helitron termini that are highly similar to known ones. There are other ad hoc methods for Helitron identification in diverse genomes. They rely heavily on BLAST and manual annotation. De novo transposon identification algorithms like RECON and RepeatScout also depend on pair-wise genome BLAST and high sequence similarity among copies of one transposon family in the host genome. Divergent Helitrons would be missed by these de novo methods because they do not align well. BLAST has been the most valuable weapon in the arsenal of bioinformatic analysis as a result of its power in finding sequence homology at given thresholds of statistical significance. However, there is no clear division in the spectrum of sequence similarity from being completely identical to not even remotely related. Divergent sequences that evolved from one ancient ancestor may appear totally unrelated in BLAST output or manual inspection, yet they behave as one functional family or bear common features when they function. In other words, although homologous sequences always lead to common functions, function resemblance does not guarantee global sequence similarity, at least not in an apparent manner. The difficulties in functional bioinformatics studies are, by and large, attributed to this inconsistency. Hurdles in previous Helitron identification also fall into this category because of the divergent nature of Helitrons and the lack of common transposon features like terminal inverted repeats and target site duplication. Position-specific scoring matrix (PSSM) is a more flexible representation of sequence patterns than consensus sequences. Although successfully applied in various DNA binding site prediction studies, PSSM requires a target region from a group of well-aligned sequences that are functionally related. Creating such a sequence profile for Helitrons would be difficult considering our current insufficient understanding of the Helitron transposition mechanism. In order to automate and generalize Helitron identification in various species, we developed a tool, HelitronScanner, using a local combination variable (LCV) approach. LCVs were first extracted and refined from a training set compiled from previously published Helitrons. HelitronScanner searches for sequence patterns that match these LCVs. Significance of matches is measured with scores separately for the 5′ and 3′ ends of putative Helitrons and filtered with empirical thresholds. HelitronScanner identified a plethora of diverse Helitrons in many plant genomes, including those missed by previous methods, and thus should pave the way to a better understanding of the transposition mechanism of Helitrons and their evolutionary contribution to genome dynamics. The local combinational variable approach constitutes the key component of HelitronScanner. Compared to BLAST-derived sequence similarities, LCVs are more granular overrepresented patterns present at variable locations, not necessarily in line with the order of their original locations. How LCVs are combined in known Helitrons from the training set does not have to be the same as how they appear in new Helitrons, provided that putative Helitrons bear enough significance measured by the number of LCVs they contain. This relaxed constraint gives rise to the discovery of more divergent putative Helitrons that would otherwise be missed by BLAST or similar methods, while still demanding a certain degree of connection between known and predicted Helitrons. It is the LCVs that bridge functional resemblance and seemingly unrelated divergence on a whole-sequence level among Helitrons. Out of the 107,367 putative Helitrons identified by HelitronScanner from 39 plant genomes, we investigated their divergence by clustering 30-bp 3′ end sequences using the cd-hit program. shows hierarchical relationships among the top 50 clusters, which account for 39,554 Helitrons, with respective sequence logos. Sequence similarity within clusters varies. For instance, Helitron termini are more homogeneous within cluster 32, cluster 44, cluster 46 and cluster 50 than within other clusters. Similarities among clusters are revealed by the inner dendrogram in . Although the 3′-CTRR is not universal in all clusters, all clusters appear to be more conserved at the very 3′ terminus and another region a few base pairs upstream from it, which probably reflects the known 3′-end hairpin structure existing in most Helitrons.

Figure 1.

Divergent HelitronScanner identified 107,367 putative Helitrons from 39 plant genomes. Their top 50 clusters of 30-bp 3’-end sequences include 39,554 Helitrons. Similarities of the clusters are shown by the inner dendrogram. Sequence logos of the clusters are shown in the outer ring. In a host genome, Helitron copies can be almost identical or very divergent. The gradual sequence variation makes clustering Helitrons based on sequence similarities somewhat arbitrary in terms of chosen thresholds. Creating multiple sequence alignment profiles for each cluster of Helitrons is also affected by how Helitrons are clustered. The LCV approach does not require Helitron categorization before an exhaustive search for overrepresented sequence patterns in the training set. LCVs are retained during the search only if their frequency is higher than average or a preset threshold. We clustered Helitron 3′-terminal sequences from the training set and analyzed their connections to the extracted LCVs. It is natural that most LCVs are shared within clusters. Some highly frequent LCVs are even shared in many clusters. On the other hand, Helitrons in one cluster may have different sets of LCVs due to sequence variation within the cluster. Generally LCVs do not coincide with Helitron clusters. As in , we chose 20 Helitrons from each of the top 5 clusters (blue circles) in the training set and connected them with the LCVs (red circles) they contain. Only 46 LCVs that are shared by less than 30% of Helitrons in the training set were depicted here to ensure better visualization. It can be seen from that these less frequent LCVs do not exclusively reside in one cluster, which complicates a clear categorization of Helitron families. Given the evolutionary distance revealed by Helitron terminal clustering, the mostly shared LCVs among clusters may represent nucleotide patterns that are conserved throughout evolution and are likely under selection pressure.

Figure 2.

Connections of less frequent LCVs to Helitrons in the training set are clustered based on their 3’-end sequences. The top 5 clusters, each including 20 selected Helitrons (blue circles), are connected with 46 less frequent LCVs (red circles) they contain. The LCVs are shared by less than 30% of Helitrons in the training set. More frequent LCVs are not shown here to ensure better visualization. The LCV approach does not require prior knowledge of how training sequences should be aligned or which regions are of interest, especially when experimental data is not available. We tried to extract LCVs without assumptions of regions of interest and found that LCVs reside only within 50-bp of both termini after testing 200-bp Helitron terminal sequences and 100-bp insertion sites. Most Helitrons share the 5′-TC and 3′-CTRR hallmarks at their termini. The LCVs are fine-grained representation of favorable combinations of nucleotides in divergent Helitrons. In contrast, BLAST essentially detects larger-scale sequence homology. shows the distribution of 303 and 575 LCVs from Helitron 5′ and 3′ ends respectively over 11 representative Helitron terminal sequences from the training set. Nucleotides are colored in red if they match LCVs in the 50-bp region of Helitron 5′ () and 3′ () ends. Saturation of color of each nucleotide is proportional to the number of LCVs it matches. The blue regions at the very termini are the known 5′-TC and 3′-CTRR Helitron hallmarks. One LCV may reside at variable locations in different Helitrons and may have gaps between the conserved nucleotides. Uncolored nucleotides flanked by colored ones in are gaps in LCVs. That the gapped nucleotides are less conserved and more susceptible to mutation suggests they are not as functionally crucial as the conserved nucleotides. The length of the colored region also varies with different terminal sequences. Helitrons with potential multiple ends are expected to have longer range of matched LCVs. Different numbers of matched LCVs in Helitron termini indicate location-specific weight of conserved nucleotides, as the color patterns demonstrated in . Histograms of LCV abundance at Helitron 5′ () and 3′ () ends show distribution of overall LCV weight at each location contributed by all LCVs in all Helitrons from the training set. The 9th and 15th nucleotides from the 5′ and 3′ ends, respectively, appear to be overall the most conserved locations in all Helitron termini. LCVs at Helitron 3′ ends are mostly concentrated within a 30-bp range while LCVs at 5′ ends spread more broadly, which makes it harder to detect Helitron 5′ ends than 3′ ends in practice.

Figure 3.

LCV variation and their accumulated weight in LCV distribution in Helitron 5’ (A) and 3’ (B) ends is depicted by nucleotides colored in red. Saturation of color is proportional to numbers of LCVs nucleotides match. The invariant 5’-TC and 3’-CTAG Helitron hallmarks are colored in blue. Histograms of accumulated numbers of matched LCVs in Helitron 5’ (C) and 3’ (D) ends show variation in conserved terminal regions. The large cache of overlooked Helitrons uncovered by HelitronScanner is a great resource to the research community for further study of the active roles Helitrons have played in genome dynamics. We are currently working on functional annotation and comparative analysis of the newly identified Helitrons by HelitronScanner.

Conclusion

As the key component of HelitronScanner, the LCV approach extracts granular conserved information (LCVs) at variable locations from unaligned Helitron sequences, and identifies Helitrons based on match numbers of the LCVs. HelitronScanner outperformed previous methods by utilizing LCVs collectively as definitive Helitron features besides known hallmarks. A large number of divergent Helitrons in many plant species was uncovered, which will be a great resource for research community. The results indicate that the LCV approach is more sensitive to highly divergent Helitrons than previous sequence alignment methods. Analysis of the overrepresented and conserved LCVs over different groups of Helitrons may help provide insights into the evolutionary trajectory of this unusual transposon superfamily.

20 in total

Mining hidden polymorphic sequence motifs from divergent plant helitrons.

Conclusion

1. Rolling-circle transposons in eukaryotes.

2. The maize genome contains a helitron insertion.

3. Gene duplication and exon shuffling by helitron-like transposons generate intraspecies diversity in maize.

Review 4. Predicting protein function from sequence and structure.

5. HelitronScanner uncovers a large overlooked cache of Helitron transposons in many plant genomes.

6. Gene movement by Helitron transposons contributes to the haplotype variability of maize.

7. Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli.

8. Maize haplotype with a helitron-amplified cytidine deaminase gene copy.

9. CD-HIT: accelerated for clustering the next-generation sequencing data.

10. Computational prediction and molecular confirmation of Helitron transposons in the maize genome.